From hoogendoorn.eelco at gmail.com Mon Sep 1 03:49:50 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 1 Sep 2014 09:49:50 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably. Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'. On Mon, Sep 1, 2014 at 4:36 AM, Charles R Harris wrote: > > > > On Sun, Aug 31, 2014 at 1:48 PM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> Ive organized all code I had relating to this subject in a github >> repository . >> That should facilitate shooting around ideas. Ive also added more >> documentation and structure to make it easier to see what is going on. >> >> Hopefully we can converge on a common vision, and then improve the >> documentation and testing to make it worthy of including in the numpy >> master. >> >> Note that there is also a complete rewrite of the classic >> numpy.arraysetops, such that they are also generalized to more complex >> input, such as finding unique graph edges, and so on. >> >> You mentioned getting the numpy core developers involved; are they not >> subscribed to this mailing list? I wouldn't be surprised; youd hope there >> is a channel of discussion concerning development with higher signal to >> noise.... >> >> > There are only about 2.5 of us at the moment. Those for whom this is an > itch that need scratching should hash things out and make a PR. The main > question for me is if it belongs in numpy, scipy, or somewhere else. > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emel_hasdal at hotmail.com Mon Sep 1 04:33:57 2014 From: emel_hasdal at hotmail.com (Emel Hasdal) Date: Mon, 1 Sep 2014 01:33:57 -0700 Subject: [Numpy-discussion] How to install numpy on a box without hardware FPU In-Reply-To: References: Message-ID: Hello, Is it possible to configure/install numpy on a box without a hardware FPU? When I try to install it using pip,I get a bunch of compile errors since floating-point exceptions (FE_DIVBYZERO etc) are undefined on this platform. How do I get numpy installed and working on such a platform? Thanks,Emel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Sep 1 08:05:24 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 1 Sep 2014 06:05:24 -0600 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > Sure, id like to do the hashing things out, but I would also like some > preliminary feedback as to whether this is going in a direction anyone else > sees the point of, if it conflicts with other plans, and indeed if we can > agree that numpy is the right place for it; a point which I would very much > like to defend. If there is some obvious no-go that im missing, I can do > without the drudgery of writing proper documentation ;). > > As for whether this belongs in numpy: yes, I would say so. There are the > extension of functionality to functions already in numpy, which are a > no-brainer (it need not cost anything performance wise, and ive needed > unique graph edges many many times), and there is the grouping > functionality, which is the main novelty. > > However, note that the grouping functionality itself is a very small > addition, just a few 100 lines of pure python, given that the indexing > logic has been factored out of the classic arraysetops. At least from a > developers perspective, it very much feels like a logical extension of the > same 'thing'. > > But also from a conceptual numpy perspective, grouping is really more an > 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. > It is useful for literally all kinds of programming; hence there is similar > functionality in the python standard library (itertools.groupby); so why > not have an efficient vectorized equivalent in numpy? It belongs there more > than the linalg module, arguably. > > Also, from a community perspective, a significant fraction of all > stackoverflow numpy questions are (unknowingly) exactly about 'how to do > grouping in numpy'. > What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between "developers" and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago. Note that the convention on the list is bottom posting. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Sep 1 08:18:15 2014 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 1 Sep 2014 13:18:15 +0100 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Mon, Sep 1, 2014 at 8:49 AM, Eelco Hoogendoorn wrote: > Sure, id like to do the hashing things out, but I would also like some > preliminary feedback as to whether this is going in a direction anyone else > sees the point of, if it conflicts with other plans, and indeed if we can > agree that numpy is the right place for it; a point which I would very much > like to defend. If there is some obvious no-go that im missing, I can do > without the drudgery of writing proper documentation ;). > > As for whether this belongs in numpy: yes, I would say so. There are the > extension of functionality to functions already in numpy, which are a > no-brainer (it need not cost anything performance wise, and ive needed > unique graph edges many many times), and there is the grouping > functionality, which is the main novelty. > > However, note that the grouping functionality itself is a very small > addition, just a few 100 lines of pure python, given that the indexing logic > has been factored out of the classic arraysetops. At least from a developers > perspective, it very much feels like a logical extension of the same > 'thing'. My 2 cents: I definitely agree that this is very useful fundamental functionality, and it would be great if numpy had a solution for it out of the box. My main concern is that this is a fairly complicated set of functionality and there are a lot of small decisions to be made in setting up the API for it. IME it's very hard to just read through an API like this and reason out the best way to do it by pure logic; usually it needs to get banged on for a bit in real uses before it becomes clear what the right set of trade-offs is. And numpy itself is not a great environment these kinds of iterations. So, IMO the main challenge is: how do we get the functionality into a state where we can convince ourselves that it'll be supportable in numpy indefinitely, and not need to be replaced in a year or two? Some things that might help with this convincing: - releasing it as a small standalone package on pypi and getting some real users to bang on it - any real code written against the APIs - feedback from the pandas community since they've spent a lot of time working on these issues - ...? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From hoogendoorn.eelco at gmail.com Mon Sep 1 09:58:57 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 1 Sep 2014 15:58:57 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris wrote: > > > > On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> Sure, id like to do the hashing things out, but I would also like some >> preliminary feedback as to whether this is going in a direction anyone else >> sees the point of, if it conflicts with other plans, and indeed if we can >> agree that numpy is the right place for it; a point which I would very much >> like to defend. If there is some obvious no-go that im missing, I can do >> without the drudgery of writing proper documentation ;). >> >> As for whether this belongs in numpy: yes, I would say so. There are the >> extension of functionality to functions already in numpy, which are a >> no-brainer (it need not cost anything performance wise, and ive needed >> unique graph edges many many times), and there is the grouping >> functionality, which is the main novelty. >> >> However, note that the grouping functionality itself is a very small >> addition, just a few 100 lines of pure python, given that the indexing >> logic has been factored out of the classic arraysetops. At least from a >> developers perspective, it very much feels like a logical extension of the >> same 'thing'. >> >> But also from a conceptual numpy perspective, grouping is really more an >> 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. >> It is useful for literally all kinds of programming; hence there is similar >> functionality in the python standard library (itertools.groupby); so why >> not have an efficient vectorized equivalent in numpy? It belongs there more >> than the linalg module, arguably. >> >> Also, from a community perspective, a significant fraction of all >> stackoverflow numpy questions are (unknowingly) exactly about 'how to do >> grouping in numpy'. >> > > What I'm trying to say is that numpy is a community project. We don't have > a central planning committee, the only difference between "developers" and > everyone else is activity and commit rights. Which is to say if you develop > and push this topic it is likely to go in. There certainly seems to be > interest in this functionality. The reason that I brought up scipy is that > there are some graph algorithms there that went in a couple of years ago. > > Note that the convention on the list is bottom posting. > > > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > I understand that numpy is a community project, so that the decision isn't up to any one particular person; but some early stage feedback from those active in the community would be welcome. I am generally confident that this addition makes sense, but I have not contributed to numpy before, and you don't know what you don't know and all... given that there are multiple suggestions for changing arraysetops, some coordination would be useful I think. Note that I use graph edges merely as an example; the proposed functionality is much more general than graphing algorithms specifically. The radial reduction example I included on github is particularly illustrative of the general utility of grouping functionality I think. Operations like radial reductions are rather common, and a custom implementation is quite lengthy, very bug prone, and potentially very slow. Thanks for the heads up on posting convention; ive always let gmail do my thinking for me, which works well enough for me, but I can see how not following this convention is annoying to others. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kgabor79 at gmail.com Mon Sep 1 11:23:20 2014 From: kgabor79 at gmail.com (Gabor Kovacs) Date: Mon, 1 Sep 2014 16:23:20 +0100 Subject: [Numpy-discussion] ENH IncrementalWriter for .npy files Message-ID: Dear All, I would like to add a class for writing one (possibly big) .npy file saving multiple (same dtype, compatible shape) arrays. My use case was the saving of slowly accumulating data regularly for a long time into one file. Please find a first implementation under https://github.com/numpy/numpy/pull/4987 . It currently supports writing a new file only and only in C order in the file. Opening an existing file for append and reading back parts from a very big .npy file would be straightforward next steps for a full featured class. The .npy file format is only affected by leaving some extra space for re-writing the header later with a possibly bigger "shape" field, respecting the 16-byte alignment. Example: ``` A=np.array([[0,1,2,3,4,5,6,7],[8,9,10,11,12,13,14,15]]) with np.IncrementalWriter("testfile.npy",hdrupdate=True,flush=True) as W: W.save(A) W.save(A) ``` Feel free to comment this idea. Cheers, Gabor From jtaylor.debian at googlemail.com Mon Sep 1 17:45:02 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Mon, 01 Sep 2014 23:45:02 +0200 Subject: [Numpy-discussion] How to install numpy on a box without hardware FPU In-Reply-To: References: Message-ID: <5404E8DE.10000@googlemail.com> On 01.09.2014 10:33, Emel Hasdal wrote: > Hello, > > Is it possible to configure/install numpy on a box without a hardware > FPU? When I try to install it using pip, > I get a bunch of compile errors since floating-point > exceptions (FE_DIVBYZERO etc) are undefined on this platform. > > How do I get numpy installed and working on such a platform? > If its just that you can try replacing all the fenv stuff with stubs doing nothing. You only lose some runtime warnings about special cases. Why do you want to run numpy on such a system? Numpy is not really intended to run on such devices. But it is possible the debian armel port is so far I know softfloat and numpy seems to be running fine, though it probably also emulates fenv. From emel_hasdal at hotmail.com Tue Sep 2 03:29:15 2014 From: emel_hasdal at hotmail.com (Emel Hasdal) Date: Tue, 2 Sep 2014 00:29:15 -0700 Subject: [Numpy-discussion] How to install numpy on a box without hardware FPU In-Reply-To: <5404E8DE.10000@googlemail.com> References: , , <5404E8DE.10000@googlemail.com> Message-ID: I am trying to run a python application which performs statistical calculations using Pandas which seem to depend on Numpy. Hence I have to install Numpy to get the app working. Do you mean I can change numpy/core/src/npymath/ieee754.c.src such that the functions referencing exceptions (npy_get_floatstatus, npy_clear_floatstatus, npy_set_floatstatus_divbyzer, npy_set_floatstatus_overflow, npy_set_floatstatus_underflow, npy_set_floatstatus_invalid) do nothing? Could there be any implications of this on the numpy functionality? Thanks a lot, Emel > Date: Mon, 1 Sep 2014 23:45:02 +0200 > From: jtaylor.debian at googlemail.com > To: numpy-discussion at scipy.org > Subject: Re: [Numpy-discussion] How to install numpy on a box without hardware FPU > > On 01.09.2014 10:33, Emel Hasdal wrote: > > Hello, > > > > Is it possible to configure/install numpy on a box without a hardware > > FPU? When I try to install it using pip, > > I get a bunch of compile errors since floating-point > > exceptions (FE_DIVBYZERO etc) are undefined on this platform. > > > > How do I get numpy installed and working on such a platform? > > > > If its just that you can try replacing all the fenv stuff with stubs > doing nothing. You only lose some runtime warnings about special cases. > > Why do you want to run numpy on such a system? > Numpy is not really intended to run on such devices. > But it is possible the debian armel port is so far I know softfloat and > numpy seems to be running fine, though it probably also emulates fenv. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jerome.Kieffer at esrf.fr Tue Sep 2 03:32:29 2014 From: Jerome.Kieffer at esrf.fr (Jerome Kieffer) Date: Tue, 2 Sep 2014 09:32:29 +0200 Subject: [Numpy-discussion] ENH IncrementalWriter for .npy files In-Reply-To: References: Message-ID: <20140902093229.da5a20b7827711cab08e287e@esrf.fr> Hi, This feature is very similar to what is available in hdf5 and exposed under h5py using chunks and max_size ... Cheers, -- J?r?me Kieffer tel +33 476 882 445 From cournape at gmail.com Tue Sep 2 04:00:37 2014 From: cournape at gmail.com (David Cournapeau) Date: Tue, 2 Sep 2014 09:00:37 +0100 Subject: [Numpy-discussion] How to install numpy on a box without hardware FPU In-Reply-To: References: <5404E8DE.10000@googlemail.com> Message-ID: On Tue, Sep 2, 2014 at 8:29 AM, Emel Hasdal wrote: > I am trying to run a python application which performs statistical > calculations using Pandas which seem to depend on Numpy. Hence I have to > install Numpy to get the app working. > > Do you mean I can change > > numpy/core/src/npymath/ieee754.c.src > > such that the functions referencing exceptions (npy_get_floatstatus, > npy_clear_floatstatus, npy_set_floatstatus_divbyzer, > npy_set_floatstatus_overflow, npy_set_floatstatus_underflow, > npy_set_floatstatus_invalid) do nothing? > > Could there be any implications of this on the numpy functionality? > AFAIK, few people have ever tried to run numpy on CPU without an FPU. The generic answer is that we do not know, as this is not a supported platform, so you are on your own. I would suggest you just try adding stubs to see how far you can go, and come back to this group with the result of your investigation. David -------------- next part -------------- An HTML attachment was scrubbed... URL: From emel_hasdal at hotmail.com Tue Sep 2 04:38:49 2014 From: emel_hasdal at hotmail.com (Emel Hasdal) Date: Tue, 2 Sep 2014 01:38:49 -0700 Subject: [Numpy-discussion] How to install numpy on a box without hardware FPU In-Reply-To: References: , , <5404E8DE.10000@googlemail.com>, , Message-ID: Sure. I would be happy to try it and report back. Just to make sure I understand it: I will try defining the following functions for my platform such that they do nothing: npy_get_floatstatus, npy_clear_floatstatus, npy_set_floatstatus_divbyzer, npy_set_floatstatus_overflow, npy_set_floatstatus_underflow, npy_set_floatstatus_invalid Thanks,Emel Date: Tue, 2 Sep 2014 09:00:37 +0100 From: cournape at gmail.com To: numpy-discussion at scipy.org Subject: Re: [Numpy-discussion] How to install numpy on a box without hardware FPU On Tue, Sep 2, 2014 at 8:29 AM, Emel Hasdal wrote: I am trying to run a python application which performs statistical calculations using Pandas which seem to depend on Numpy. Hence I have to install Numpy to get the app working. Do you mean I can change numpy/core/src/npymath/ieee754.c.src such that the functions referencing exceptions (npy_get_floatstatus, npy_clear_floatstatus, npy_set_floatstatus_divbyzer, npy_set_floatstatus_overflow, npy_set_floatstatus_underflow, npy_set_floatstatus_invalid) do nothing? Could there be any implications of this on the numpy functionality? AFAIK, few people have ever tried to run numpy on CPU without an FPU. The generic answer is that we do not know, as this is not a supported platform, so you are on your own. I would suggest you just try adding stubs to see how far you can go, and come back to this group with the result of your investigation. David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 2 20:40:09 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 2 Sep 2014 18:40:09 -0600 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> >> On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < >> hoogendoorn.eelco at gmail.com> wrote: >> >>> Sure, id like to do the hashing things out, but I would also like some >>> preliminary feedback as to whether this is going in a direction anyone else >>> sees the point of, if it conflicts with other plans, and indeed if we can >>> agree that numpy is the right place for it; a point which I would very much >>> like to defend. If there is some obvious no-go that im missing, I can do >>> without the drudgery of writing proper documentation ;). >>> >>> As for whether this belongs in numpy: yes, I would say so. There are the >>> extension of functionality to functions already in numpy, which are a >>> no-brainer (it need not cost anything performance wise, and ive needed >>> unique graph edges many many times), and there is the grouping >>> functionality, which is the main novelty. >>> >>> However, note that the grouping functionality itself is a very small >>> addition, just a few 100 lines of pure python, given that the indexing >>> logic has been factored out of the classic arraysetops. At least from a >>> developers perspective, it very much feels like a logical extension of the >>> same 'thing'. >>> >>> But also from a conceptual numpy perspective, grouping is really more an >>> 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. >>> It is useful for literally all kinds of programming; hence there is similar >>> functionality in the python standard library (itertools.groupby); so why >>> not have an efficient vectorized equivalent in numpy? It belongs there more >>> than the linalg module, arguably. >>> >>> Also, from a community perspective, a significant fraction of all >>> stackoverflow numpy questions are (unknowingly) exactly about 'how to do >>> grouping in numpy'. >>> >> >> What I'm trying to say is that numpy is a community project. We don't >> have a central planning committee, the only difference between "developers" >> and everyone else is activity and commit rights. Which is to say if you >> develop and push this topic it is likely to go in. There certainly seems to >> be interest in this functionality. The reason that I brought up scipy is >> that there are some graph algorithms there that went in a couple of years >> ago. >> >> Note that the convention on the list is bottom posting. >> >> >> >> Chuck >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > I understand that numpy is a community project, so that the decision isn't > up to any one particular person; but some early stage feedback from those > active in the community would be welcome. I am generally confident that > this addition makes sense, but I have not contributed to numpy before, > and you don't know what you don't know and all... given that there are > multiple suggestions for changing arraysetops, some coordination would be > useful I think. > > Note that I use graph edges merely as an example; the proposed > functionality is much more general than graphing algorithms specifically. > The radial reduction > example > I included on github is particularly illustrative of the general utility of > grouping functionality I think. Operations like radial reductions are > rather common, and a custom implementation is quite lengthy, very bug > prone, and potentially very slow. > > Thanks for the heads up on posting convention; ive always let gmail do my > thinking for me, which works well enough for me, but I can see how not > following this convention is annoying to others. > > What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Tue Sep 2 22:07:52 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 2 Sep 2014 19:07:52 -0700 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris wrote: > > > What do you think about the suggestion of timsort? One would need to > concatenate the arrays before sorting, but it should be fairly efficient. > Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months. >From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others. I have also been thinking lately that one of the problems with all these unique-related computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table... Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjw at ncf.ca Wed Sep 3 08:26:02 2014 From: cjw at ncf.ca (cjw) Date: Wed, 03 Sep 2014 08:26:02 -0400 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: <540708DA.7070604@ncf.ca> Re: [Numpy-discussion] Does a `mergesorted` function make sense? On 02/09/2014 8:40 PM, Charles R Harris wrote: > > > > On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn > > wrote: > > On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris > > wrote: > > > > > On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn > > wrote: > > Sure, id like to do the hashing things out, but I would > also like some preliminary feedback as to whether this is > going in a direction anyone else sees the point of, if it > conflicts with other plans, and indeed if we can agree > that numpy is the right place for it; a point which I > would very much like to defend. If there is some obvious > no-go that im missing, I can do without the drudgery of > writing proper documentation ;). > These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. Colin W. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Wed Sep 3 09:41:51 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Wed, 3 Sep 2014 15:41:51 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: >> >> >> What do you think about the suggestion of timsort? One would need to >> concatenate the arrays before sorting, but it should be fairly efficient. >> > > Timsort is very cool, and it would definitely be fun to implement in > numpy. It is also a lot more work that merging two sorted arrays! I think > +1 if someone else does it, but although I would love to be part of it, I > am not sure I will be able to find time to look into it seriously in the > next couple of months. > > From a setops point of view, merging two sorted arrays makes it very > straightforward to compute, together with (or instead of) the result of the > merge, index arrays that let you calculate things like `in1d` faster. > Although perhaps an `argtimsort` could provide the same functionality, not > sure. I will probably wrap up what I have, put a lace on it, and submit it > as a PR. Even if it is not destined to be merged, it may serve as a warning > to others. > > I have also been thinking lately that one of the problems with all these > unique-related computations may be a case of having a hammer and seeing > everything as nails. Perhaps the algorithm that needs to be ported from > Python is not the sorting one, but the hash table... > > Jaime > > Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. Also, getting the unique values/keys in sorted order is a nice side-benefit for many applications. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 3 12:33:02 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 3 Sep 2014 09:33:02 -0700 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > Not sure about the hashing. Indeed one can also build an index of a set > by means of a hash table, but its questionable if this leads to improved > performance over performing an argsort. Hashing may have better asymptotic > time complexity in theory, but many datasets used in practice are very easy > to sort (O(N)-ish), and the time-constant of hashing is higher. But more > importantly, using a hash-table guarantees poor cache behavior for many > operations using this index. By contrast, sorting may (but need not) make > one random access pass to build the index, and may (but need not) perform > one random access to reorder values for grouping. But insofar as the keys > are better behaved than pure random, this coherence will be exploited. > If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop Of course, there is no point in -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 3 12:46:16 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 3 Sep 2014 09:46:16 -0700 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> Not sure about the hashing. Indeed one can also build an index of a set >> by means of a hash table, but its questionable if this leads to improved >> performance over performing an argsort. Hashing may have better asymptotic >> time complexity in theory, but many datasets used in practice are very easy >> to sort (O(N)-ish), and the time-constant of hashing is higher. But more >> importantly, using a hash-table guarantees poor cache behavior for many >> operations using this index. By contrast, sorting may (but need not) make >> one random access pass to build the index, and may (but need not) perform >> one random access to reorder values for grouping. But insofar as the keys >> are better behaved than pure random, this coherence will be exploited. >> > > If you want to give it a try, these branch of my numpy fork has hash table > based implementations of unique (with no extra indices) and in1d: > > https://github.com/jaimefrio/numpy/tree/hash-unique > > A use cases where the hash table is clearly better: > > In [1]: import numpy as np > In [2]: from numpy.lib._compiled_base import _unique, _in1d > > In [3]: a = np.random.randint(10, size=(10000,)) > In [4]: %timeit np.unique(a) > 1000 loops, best of 3: 258 us per loop > In [5]: %timeit _unique(a) > 10000 loops, best of 3: 143 us per loop > In [6]: %timeit np.sort(_unique(a)) > 10000 loops, best of 3: 149 us per loop > > It typically performs between 1.5x and 4x faster than sorting. I haven't > profiled it properly to know, but there may be quite a bit of performance > to dig out: have type specific comparison functions, optimize the starting > hash table size based on the size of the array to avoid reinsertions... > > If getting the elements sorted is a necessity, and the array contains very > few or no repeated items, then the hash table approach may even perform > worse,: > > In [8]: a = np.random.randint(10000, size=(5000,)) > In [9]: %timeit np.unique(a) > 1000 loops, best of 3: 277 us per loop > In [10]: %timeit np.sort(_unique(a)) > 1000 loops, best of 3: 320 us per loop > > But the hash table still wins in extracting the unique items only: > > In [11]: %timeit _unique(a) > 10000 loops, best of 3: 187 us per loop > > Where the hash table shines is in more elaborate situations. If you keep > the first index where it was found, and the number of repeats, in the hash > table, you can get return_index and return_counts almost for free, which > means you are performing an extra 3x faster than with sorting. > return_inverse requires a little more trickery, so I won;t attempt to > quantify the improvement. But I wouldn't be surprised if, after fine tuning > it, there is close to an order of magnitude overall improvement > > The spped-up for in1d is also nice: > > In [16]: a = np.random.randint(1000, size=(1000,)) > In [17]: b = np.random.randint(1000, size=(500,)) > In [18]: %timeit np.in1d(a, b) > 1000 loops, best of 3: 178 us per loop > In [19]: %timeit _in1d(a, b) > 10000 loops, best of 3: 30.1 us per loop > > Of course, there is no point in > Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 3 12:53:12 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 3 Sep 2014 09:53:12 -0700 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: <540708DA.7070604@ncf.ca> References: <540708DA.7070604@ncf.ca> Message-ID: On Wed, Sep 3, 2014 at 5:26 AM, cjw wrote: > These are good issues, that need to be discussed and resolved. Python has > the benefit of having a BDFL. Numpy has no similar arrangement. > In the post-numarray period, Travis Oliphant took that role and advanced > the package in many ways. > > It seems that Travis is no longer available. There is a need for someone, > or maybe some group of three, who would be guided by the sort of thinking > Charles Harris sets out above, who would make the final decision. > We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :-) Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 3 13:25:45 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 3 Sep 2014 11:25:45 -0600 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: <540708DA.7070604@ncf.ca> Message-ID: On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 3, 2014 at 5:26 AM, cjw wrote: > >> These are good issues, that need to be discussed and resolved. Python >> has the benefit of having a BDFL. Numpy has no similar arrangement. >> In the post-numarray period, Travis Oliphant took that role and advanced >> the package in many ways. >> >> It seems that Travis is no longer available. There is a need for >> someone, or maybe some group of three, who would be guided by the sort of >> thinking Charles Harris sets out above, who would make the final decision. >> > > We should crown Charles philosopher king, and let him wisely rule over us > with the aid of his aristocracy of devs with merge rights. We would make > Plato proud! :-) > Charles I needs an heir in case of execution by the Parliamentary forces. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.isaac at gmail.com Wed Sep 3 17:19:56 2014 From: alan.isaac at gmail.com (Alan G Isaac) Date: Wed, 03 Sep 2014 17:19:56 -0400 Subject: [Numpy-discussion] odd (?) behavior: negative integer scalar in exponent Message-ID: <540785FC.2090903@gmail.com> What should be the value of `2**np.int_(-32)`? It is apparently currently computed as `1. / (2**np.int_(32))`, so the computation overflows (when a C long is 32 bits). I would have hoped for it to be computed as `1./(2.**np.int_(32))`. Cheers, Alan Isaac From charlesr.harris at gmail.com Wed Sep 3 17:47:27 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 3 Sep 2014 15:47:27 -0600 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. Message-ID: Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Sep 3 17:48:45 2014 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 3 Sep 2014 22:48:45 +0100 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris wrote: > Hi All, > > I'd like to give Jaime commit rights. Having at three active developers with > commit rights is the goal and Jaime has been pretty consistent with code > submissions and discussion participation. +1 -- Robert Kern From ralf.gommers at gmail.com Wed Sep 3 18:42:47 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Thu, 4 Sep 2014 00:42:47 +0200 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern wrote: > On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris > wrote: > > Hi All, > > > > I'd like to give Jaime commit rights. Having at three active developers > with > > commit rights is the goal and Jaime has been pretty consistent with code > > submissions and discussion participation. > > +1 > +1 excellent idea Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 3 19:25:03 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 3 Sep 2014 17:25:03 -0600 Subject: [Numpy-discussion] odd (?) behavior: negative integer scalar in exponent In-Reply-To: <540785FC.2090903@gmail.com> References: <540785FC.2090903@gmail.com> Message-ID: On Wed, Sep 3, 2014 at 3:19 PM, Alan G Isaac wrote: > What should be the value of `2**np.int_(-32)`? > It is apparently currently computed as `1. / (2**np.int_(32))`, > so the computation overflows (when a C long is 32 bits). > I would have hoped for it to be computed as `1./(2.**np.int_(32))`. > > Looks like a bug to me. Chuck. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Wed Sep 3 19:48:10 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 01:48:10 +0200 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: +1; though I am relatively new to the scene, Jaime's contributions have always stood out to me as thoughtful. On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers wrote: > > > > On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern > wrote: > >> On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris >> wrote: >> > Hi All, >> > >> > I'd like to give Jaime commit rights. Having at three active developers >> with >> > commit rights is the goal and Jaime has been pretty consistent with code >> > submissions and discussion participation. >> >> +1 >> > > +1 excellent idea > > Ralf > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 3 20:47:41 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 3 Sep 2014 18:47:41 -0600 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 5:48 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > +1; though I am relatively new to the scene, Jaime's contributions have > always stood out to me as thoughtful. > > > On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers > wrote: > >> >> >> >> On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern >> wrote: >> >>> On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris >>> wrote: >>> > Hi All, >>> > >>> > I'd like to give Jaime commit rights. Having at three active >>> developers with >>> > commit rights is the goal and Jaime has been pretty consistent with >>> code >>> > submissions and discussion participation. >>> >>> +1 >>> >> >> +1 excellent idea >> > I think the ayes will have it. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 3 21:34:13 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 3 Sep 2014 18:34:13 -0700 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 5:47 PM, Charles R Harris wrote: > > I think the ayes will have it. > As I told Chuck (because I now get to call Charles Chuck, right? :-)), I am not sure I am fully qualified for the job: looking at the names on that list is a humbling experience. But even if I am the idiot brother, it feels good to be part of the family. Numpy has provided me with countless hours of learning and enjoyment. and I really look forward to giving back, even if only a fraction of that. Thanks a lot for the trust! Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Sep 4 03:46:50 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 04 Sep 2014 09:46:50 +0200 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: <1409816810.9142.2.camel@sebastian-t440> On Mi, 2014-09-03 at 18:47 -0600, Charles R Harris wrote: > > > > On Wed, Sep 3, 2014 at 5:48 PM, Eelco Hoogendoorn > wrote: > +1; though I am relatively new to the scene, Jaime's > contributions have always stood out to me as thoughtful. > > > On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers > wrote: > > > > > On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern > wrote: > On Wed, Sep 3, 2014 at 10:47 PM, Charles R > Harris > wrote: > > Hi All, > > > > I'd like to give Jaime commit rights. Having > at three active developers with > > commit rights is the goal and Jaime has been > pretty consistent with code > > submissions and discussion participation. > > > +1 > > > +1 excellent idea > > > > I think the ayes will have it. > Aye. > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From hoogendoorn.eelco at gmail.com Thu Sep 4 04:31:01 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 10:31:01 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >> hoogendoorn.eelco at gmail.com> wrote: >> >>> Not sure about the hashing. Indeed one can also build an index of a set >>> by means of a hash table, but its questionable if this leads to improved >>> performance over performing an argsort. Hashing may have better asymptotic >>> time complexity in theory, but many datasets used in practice are very easy >>> to sort (O(N)-ish), and the time-constant of hashing is higher. But more >>> importantly, using a hash-table guarantees poor cache behavior for many >>> operations using this index. By contrast, sorting may (but need not) make >>> one random access pass to build the index, and may (but need not) perform >>> one random access to reorder values for grouping. But insofar as the keys >>> are better behaved than pure random, this coherence will be exploited. >>> >> >> If you want to give it a try, these branch of my numpy fork has hash >> table based implementations of unique (with no extra indices) and in1d: >> >> https://github.com/jaimefrio/numpy/tree/hash-unique >> >> A use cases where the hash table is clearly better: >> >> In [1]: import numpy as np >> In [2]: from numpy.lib._compiled_base import _unique, _in1d >> >> In [3]: a = np.random.randint(10, size=(10000,)) >> In [4]: %timeit np.unique(a) >> 1000 loops, best of 3: 258 us per loop >> In [5]: %timeit _unique(a) >> 10000 loops, best of 3: 143 us per loop >> In [6]: %timeit np.sort(_unique(a)) >> 10000 loops, best of 3: 149 us per loop >> >> It typically performs between 1.5x and 4x faster than sorting. I haven't >> profiled it properly to know, but there may be quite a bit of performance >> to dig out: have type specific comparison functions, optimize the starting >> hash table size based on the size of the array to avoid reinsertions... >> >> If getting the elements sorted is a necessity, and the array contains >> very few or no repeated items, then the hash table approach may even >> perform worse,: >> >> In [8]: a = np.random.randint(10000, size=(5000,)) >> In [9]: %timeit np.unique(a) >> 1000 loops, best of 3: 277 us per loop >> In [10]: %timeit np.sort(_unique(a)) >> 1000 loops, best of 3: 320 us per loop >> >> But the hash table still wins in extracting the unique items only: >> >> In [11]: %timeit _unique(a) >> 10000 loops, best of 3: 187 us per loop >> >> Where the hash table shines is in more elaborate situations. If you keep >> the first index where it was found, and the number of repeats, in the hash >> table, you can get return_index and return_counts almost for free, which >> means you are performing an extra 3x faster than with sorting. >> return_inverse requires a little more trickery, so I won;t attempt to >> quantify the improvement. But I wouldn't be surprised if, after fine tuning >> it, there is close to an order of magnitude overall improvement >> >> The spped-up for in1d is also nice: >> >> In [16]: a = np.random.randint(1000, size=(1000,)) >> In [17]: b = np.random.randint(1000, size=(500,)) >> In [18]: %timeit np.in1d(a, b) >> 1000 loops, best of 3: 178 us per loop >> In [19]: %timeit _in1d(a, b) >> 10000 loops, best of 3: 30.1 us per loop >> >> Of course, there is no point in >> > > Ooops!!! Hit the send button too quick. Not to extend myself too long: if > we are going to rethink all of this, we should approach it with an open > mind. Still, and this post is not helping with that either, I am afraid > that we are discussing implementation details, but are missing a broader > vision of what we want to accomplish and why. That vision of what numpy's > grouping functionality, if any, should be, and how it complements or > conflicts with what pandas is providing, should precede anything else. I > now I haven't, but has anyone looked at how pandas implements grouping? > Their documentation on the subject is well worth a read: > > http://pandas.pydata.org/pandas-docs/stable/groupby.html > > Does numpy need to replicate this? What/why/how can we add to that? > > Jaime > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > I would certainly not be opposed to having a hashing based indexing mechanism; I think it would make sense design-wise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset. Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one. >From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge. I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single one-line hit-and-run association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data. In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Sep 4 07:32:51 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 04 Sep 2014 07:32:51 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) Message-ID: http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html -- -- Those who don't understand recursion are doomed to repeat it From ben.root at ou.edu Thu Sep 4 09:16:47 2014 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 4 Sep 2014 09:16:47 -0400 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: Jaime, I had the same feeling when John Hunter gave me commit rights to matplotlib. I later asked him about it, and he said that he gives commit rights to those who annoy the mailing list the most. So, it might not be *that* humbling... :-P Cheers! Ben Root On Wed, Sep 3, 2014 at 9:34 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 3, 2014 at 5:47 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> I think the ayes will have it. >> > > As I told Chuck (because I now get to call Charles Chuck, right? :-)), I > am not sure I am fully qualified for the job: looking at the names on that > list is a humbling experience. But even if I am the idiot brother, it feels > good to be part of the family. > > Numpy has provided me with countless hours of learning and enjoyment. and > I really look forward to giving back, even if only a fraction of that. > > Thanks a lot for the trust! > > Jaime > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Sep 4 10:43:01 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 4 Sep 2014 08:43:01 -0600 Subject: [Numpy-discussion] Give Jaime Fernandez commit rights. In-Reply-To: References: Message-ID: On Wed, Sep 3, 2014 at 7:34 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 3, 2014 at 5:47 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> I think the ayes will have it. >> > > As I told Chuck (because I now get to call Charles Chuck, right? :-)), I > am not sure I am fully qualified for the job: looking at the names on that > list is a humbling experience. But even if I am the idiot brother, it feels > good to be part of the family. > > Numpy has provided me with countless hours of learning and enjoyment. and > I really look forward to giving back, even if only a fraction of that. > > Thanks a lot for the trust! > One thing you might want to check is if you have a strong password. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Thu Sep 4 13:39:14 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 19:39:14 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > > On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >> jaime.frio at gmail.com> wrote: >> >>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>> hoogendoorn.eelco at gmail.com> wrote: >>> >>>> Not sure about the hashing. Indeed one can also build an index of a >>>> set by means of a hash table, but its questionable if this leads to >>>> improved performance over performing an argsort. Hashing may have better >>>> asymptotic time complexity in theory, but many datasets used in practice >>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>> higher. But more importantly, using a hash-table guarantees poor cache >>>> behavior for many operations using this index. By contrast, sorting may >>>> (but need not) make one random access pass to build the index, and may (but >>>> need not) perform one random access to reorder values for grouping. But >>>> insofar as the keys are better behaved than pure random, this coherence >>>> will be exploited. >>>> >>> >>> If you want to give it a try, these branch of my numpy fork has hash >>> table based implementations of unique (with no extra indices) and in1d: >>> >>> https://github.com/jaimefrio/numpy/tree/hash-unique >>> >>> A use cases where the hash table is clearly better: >>> >>> In [1]: import numpy as np >>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>> >>> In [3]: a = np.random.randint(10, size=(10000,)) >>> In [4]: %timeit np.unique(a) >>> 1000 loops, best of 3: 258 us per loop >>> In [5]: %timeit _unique(a) >>> 10000 loops, best of 3: 143 us per loop >>> In [6]: %timeit np.sort(_unique(a)) >>> 10000 loops, best of 3: 149 us per loop >>> >>> It typically performs between 1.5x and 4x faster than sorting. I haven't >>> profiled it properly to know, but there may be quite a bit of performance >>> to dig out: have type specific comparison functions, optimize the starting >>> hash table size based on the size of the array to avoid reinsertions... >>> >>> If getting the elements sorted is a necessity, and the array contains >>> very few or no repeated items, then the hash table approach may even >>> perform worse,: >>> >>> In [8]: a = np.random.randint(10000, size=(5000,)) >>> In [9]: %timeit np.unique(a) >>> 1000 loops, best of 3: 277 us per loop >>> In [10]: %timeit np.sort(_unique(a)) >>> 1000 loops, best of 3: 320 us per loop >>> >>> But the hash table still wins in extracting the unique items only: >>> >>> In [11]: %timeit _unique(a) >>> 10000 loops, best of 3: 187 us per loop >>> >>> Where the hash table shines is in more elaborate situations. If you keep >>> the first index where it was found, and the number of repeats, in the hash >>> table, you can get return_index and return_counts almost for free, which >>> means you are performing an extra 3x faster than with sorting. >>> return_inverse requires a little more trickery, so I won;t attempt to >>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>> it, there is close to an order of magnitude overall improvement >>> >>> The spped-up for in1d is also nice: >>> >>> In [16]: a = np.random.randint(1000, size=(1000,)) >>> In [17]: b = np.random.randint(1000, size=(500,)) >>> In [18]: %timeit np.in1d(a, b) >>> 1000 loops, best of 3: 178 us per loop >>> In [19]: %timeit _in1d(a, b) >>> 10000 loops, best of 3: 30.1 us per loop >>> >>> Of course, there is no point in >>> >> >> Ooops!!! Hit the send button too quick. Not to extend myself too long: if >> we are going to rethink all of this, we should approach it with an open >> mind. Still, and this post is not helping with that either, I am afraid >> that we are discussing implementation details, but are missing a broader >> vision of what we want to accomplish and why. That vision of what numpy's >> grouping functionality, if any, should be, and how it complements or >> conflicts with what pandas is providing, should precede anything else. I >> now I haven't, but has anyone looked at how pandas implements grouping? >> Their documentation on the subject is well worth a read: >> >> http://pandas.pydata.org/pandas-docs/stable/groupby.html >> >> Does numpy need to replicate this? What/why/how can we add to that? >> >> Jaime >> >> -- >> (\__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes >> de dominaci?n mundial. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> I would certainly not be opposed to having a hashing based indexing > mechanism; I think it would make sense design-wise to have a HashIndex > class with the same interface as the rest, and use that subclass in those > arraysetops where it makes sense. The 'how to' of indexing and its > applications are largely orthogonal I think (with some tiny performance > compromises which are worth the abstraction imo). For datasets which are > not purely random, have many unique items, and which do not fit into cache, > I would expect sorting to come out on top, but indeed it depends on the > dataset. > > Yeah, the question how pandas does grouping, and whether we can do better, > is a relevant one. > > From what I understand, pandas relies on cython extensions to get > vectorized grouping functionality. This is no longer necessary since the > introduction of ufuncs in numpy. I don't know how the implementations > compare in terms of performance, but I doubt the difference is huge. > > I personally use grouping a lot in my code, and I don't like having to use > pandas for it. Most importantly, I don't want to go around creating a > dataframe for a single one-line hit-and-run association between keys and > values. The permanent association of different types of data and their > metadata which pandas offers is I think the key difference from numpy, > which is all about manipulating just plain ndarrays. Arguably, grouping > itself is a pretty elementary manipulating of ndarrays, and adding calls to > DataFrame or Series inbetween a statement that could just be > simply group_by(keys).mean(values) feels wrong to me. As does including > pandas as a dependency just to use this small piece of > functionality. Grouping is a more general functionality than any particular > method of organizing your data. > > In terms of features, adding transformations and filtering might be nice > too; I hadn't thought about it, but that is because unlike the currently > implemented features, the need has never arose for me. Im only a small > sample size, and I don't see any fundamental objection to adding such > functionality though. It certainly raises the question as to where to draw > the line with pandas; but my rule of thumb is that if you can think of it > as an elementary operation on ndarrays, then it probably belongs in numpy. > > Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Thu Sep 4 13:55:18 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Thu, 4 Sep 2014 10:55:18 -0700 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > > On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> >> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < >> jaime.frio at gmail.com> wrote: >> >>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >>> jaime.frio at gmail.com> wrote: >>> >>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>> hoogendoorn.eelco at gmail.com> wrote: >>>> >>>>> Not sure about the hashing. Indeed one can also build an index of a >>>>> set by means of a hash table, but its questionable if this leads to >>>>> improved performance over performing an argsort. Hashing may have better >>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>> behavior for many operations using this index. By contrast, sorting may >>>>> (but need not) make one random access pass to build the index, and may (but >>>>> need not) perform one random access to reorder values for grouping. But >>>>> insofar as the keys are better behaved than pure random, this coherence >>>>> will be exploited. >>>>> >>>> >>>> If you want to give it a try, these branch of my numpy fork has hash >>>> table based implementations of unique (with no extra indices) and in1d: >>>> >>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>> >>>> A use cases where the hash table is clearly better: >>>> >>>> In [1]: import numpy as np >>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>> >>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>> In [4]: %timeit np.unique(a) >>>> 1000 loops, best of 3: 258 us per loop >>>> In [5]: %timeit _unique(a) >>>> 10000 loops, best of 3: 143 us per loop >>>> In [6]: %timeit np.sort(_unique(a)) >>>> 10000 loops, best of 3: 149 us per loop >>>> >>>> It typically performs between 1.5x and 4x faster than sorting. I >>>> haven't profiled it properly to know, but there may be quite a bit of >>>> performance to dig out: have type specific comparison functions, optimize >>>> the starting hash table size based on the size of the array to avoid >>>> reinsertions... >>>> >>>> If getting the elements sorted is a necessity, and the array contains >>>> very few or no repeated items, then the hash table approach may even >>>> perform worse,: >>>> >>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>> In [9]: %timeit np.unique(a) >>>> 1000 loops, best of 3: 277 us per loop >>>> In [10]: %timeit np.sort(_unique(a)) >>>> 1000 loops, best of 3: 320 us per loop >>>> >>>> But the hash table still wins in extracting the unique items only: >>>> >>>> In [11]: %timeit _unique(a) >>>> 10000 loops, best of 3: 187 us per loop >>>> >>>> Where the hash table shines is in more elaborate situations. If you >>>> keep the first index where it was found, and the number of repeats, in the >>>> hash table, you can get return_index and return_counts almost for free, >>>> which means you are performing an extra 3x faster than with sorting. >>>> return_inverse requires a little more trickery, so I won;t attempt to >>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>> it, there is close to an order of magnitude overall improvement >>>> >>>> The spped-up for in1d is also nice: >>>> >>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>> In [18]: %timeit np.in1d(a, b) >>>> 1000 loops, best of 3: 178 us per loop >>>> In [19]: %timeit _in1d(a, b) >>>> 10000 loops, best of 3: 30.1 us per loop >>>> >>>> Of course, there is no point in >>>> >>> >>> Ooops!!! Hit the send button too quick. Not to extend myself too long: >>> if we are going to rethink all of this, we should approach it with an open >>> mind. Still, and this post is not helping with that either, I am afraid >>> that we are discussing implementation details, but are missing a broader >>> vision of what we want to accomplish and why. That vision of what numpy's >>> grouping functionality, if any, should be, and how it complements or >>> conflicts with what pandas is providing, should precede anything else. I >>> now I haven't, but has anyone looked at how pandas implements grouping? >>> Their documentation on the subject is well worth a read: >>> >>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>> >>> Does numpy need to replicate this? What/why/how can we add to that? >>> >>> Jaime >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>> planes de dominaci?n mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> I would certainly not be opposed to having a hashing based indexing >> mechanism; I think it would make sense design-wise to have a HashIndex >> class with the same interface as the rest, and use that subclass in those >> arraysetops where it makes sense. The 'how to' of indexing and its >> applications are largely orthogonal I think (with some tiny performance >> compromises which are worth the abstraction imo). For datasets which are >> not purely random, have many unique items, and which do not fit into cache, >> I would expect sorting to come out on top, but indeed it depends on the >> dataset. >> >> Yeah, the question how pandas does grouping, and whether we can do >> better, is a relevant one. >> >> From what I understand, pandas relies on cython extensions to get >> vectorized grouping functionality. This is no longer necessary since the >> introduction of ufuncs in numpy. I don't know how the implementations >> compare in terms of performance, but I doubt the difference is huge. >> >> I personally use grouping a lot in my code, and I don't like having to >> use pandas for it. Most importantly, I don't want to go around creating a >> dataframe for a single one-line hit-and-run association between keys and >> values. The permanent association of different types of data and their >> metadata which pandas offers is I think the key difference from numpy, >> which is all about manipulating just plain ndarrays. Arguably, grouping >> itself is a pretty elementary manipulating of ndarrays, and adding calls to >> DataFrame or Series inbetween a statement that could just be >> simply group_by(keys).mean(values) feels wrong to me. As does including >> pandas as a dependency just to use this small piece of >> functionality. Grouping is a more general functionality than any particular >> method of organizing your data. >> >> In terms of features, adding transformations and filtering might be nice >> too; I hadn't thought about it, but that is because unlike the currently >> implemented features, the need has never arose for me. Im only a small >> sample size, and I don't see any fundamental objection to adding such >> functionality though. It certainly raises the question as to where to draw >> the line with pandas; but my rule of thumb is that if you can think of it >> as an elementary operation on ndarrays, then it probably belongs in numpy. >> >> > Oh I forgot to add: with an indexing mechanism based on sorting, unique > values and counts also come 'for free', not counting the O(N) cost of > actually creating those arrays. The only time an operating relying on an > index incurs another nontrivial amount of overhead is in case its 'rank' or > 'inverse' property is used, which invokes another argsort. But for the vast > majority of grouping or set operations, these properties are never used. > That extra argsort is now gone from master: https://github.com/numpy/numpy/pull/5012 Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower: In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Thu Sep 4 14:14:22 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 20:14:22 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a use-case where the various sorts involved are the dominant cost, which I imagine they are, for out-of-cache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place. On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> >> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >> hoogendoorn.eelco at gmail.com> wrote: >> >>> >>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < >>> jaime.frio at gmail.com> wrote: >>> >>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >>>> jaime.frio at gmail.com> wrote: >>>> >>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>> hoogendoorn.eelco at gmail.com> wrote: >>>>> >>>>>> Not sure about the hashing. Indeed one can also build an index of a >>>>>> set by means of a hash table, but its questionable if this leads to >>>>>> improved performance over performing an argsort. Hashing may have better >>>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>> (but need not) make one random access pass to build the index, and may (but >>>>>> need not) perform one random access to reorder values for grouping. But >>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>> will be exploited. >>>>>> >>>>> >>>>> If you want to give it a try, these branch of my numpy fork has hash >>>>> table based implementations of unique (with no extra indices) and in1d: >>>>> >>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>> >>>>> A use cases where the hash table is clearly better: >>>>> >>>>> In [1]: import numpy as np >>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>> >>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>> In [4]: %timeit np.unique(a) >>>>> 1000 loops, best of 3: 258 us per loop >>>>> In [5]: %timeit _unique(a) >>>>> 10000 loops, best of 3: 143 us per loop >>>>> In [6]: %timeit np.sort(_unique(a)) >>>>> 10000 loops, best of 3: 149 us per loop >>>>> >>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>> performance to dig out: have type specific comparison functions, optimize >>>>> the starting hash table size based on the size of the array to avoid >>>>> reinsertions... >>>>> >>>>> If getting the elements sorted is a necessity, and the array contains >>>>> very few or no repeated items, then the hash table approach may even >>>>> perform worse,: >>>>> >>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>> In [9]: %timeit np.unique(a) >>>>> 1000 loops, best of 3: 277 us per loop >>>>> In [10]: %timeit np.sort(_unique(a)) >>>>> 1000 loops, best of 3: 320 us per loop >>>>> >>>>> But the hash table still wins in extracting the unique items only: >>>>> >>>>> In [11]: %timeit _unique(a) >>>>> 10000 loops, best of 3: 187 us per loop >>>>> >>>>> Where the hash table shines is in more elaborate situations. If you >>>>> keep the first index where it was found, and the number of repeats, in the >>>>> hash table, you can get return_index and return_counts almost for free, >>>>> which means you are performing an extra 3x faster than with sorting. >>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>>> it, there is close to an order of magnitude overall improvement >>>>> >>>>> The spped-up for in1d is also nice: >>>>> >>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>> In [18]: %timeit np.in1d(a, b) >>>>> 1000 loops, best of 3: 178 us per loop >>>>> In [19]: %timeit _in1d(a, b) >>>>> 10000 loops, best of 3: 30.1 us per loop >>>>> >>>>> Of course, there is no point in >>>>> >>>> >>>> Ooops!!! Hit the send button too quick. Not to extend myself too long: >>>> if we are going to rethink all of this, we should approach it with an open >>>> mind. Still, and this post is not helping with that either, I am afraid >>>> that we are discussing implementation details, but are missing a broader >>>> vision of what we want to accomplish and why. That vision of what numpy's >>>> grouping functionality, if any, should be, and how it complements or >>>> conflicts with what pandas is providing, should precede anything else. I >>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>> Their documentation on the subject is well worth a read: >>>> >>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>> >>>> Does numpy need to replicate this? What/why/how can we add to that? >>>> >>>> Jaime >>>> >>>> -- >>>> (\__/) >>>> ( O.o) >>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>>> planes de dominaci?n mundial. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> I would certainly not be opposed to having a hashing based indexing >>> mechanism; I think it would make sense design-wise to have a HashIndex >>> class with the same interface as the rest, and use that subclass in those >>> arraysetops where it makes sense. The 'how to' of indexing and its >>> applications are largely orthogonal I think (with some tiny performance >>> compromises which are worth the abstraction imo). For datasets which are >>> not purely random, have many unique items, and which do not fit into cache, >>> I would expect sorting to come out on top, but indeed it depends on the >>> dataset. >>> >>> Yeah, the question how pandas does grouping, and whether we can do >>> better, is a relevant one. >>> >>> From what I understand, pandas relies on cython extensions to get >>> vectorized grouping functionality. This is no longer necessary since the >>> introduction of ufuncs in numpy. I don't know how the implementations >>> compare in terms of performance, but I doubt the difference is huge. >>> >>> I personally use grouping a lot in my code, and I don't like having to >>> use pandas for it. Most importantly, I don't want to go around creating a >>> dataframe for a single one-line hit-and-run association between keys and >>> values. The permanent association of different types of data and their >>> metadata which pandas offers is I think the key difference from numpy, >>> which is all about manipulating just plain ndarrays. Arguably, grouping >>> itself is a pretty elementary manipulating of ndarrays, and adding calls to >>> DataFrame or Series inbetween a statement that could just be >>> simply group_by(keys).mean(values) feels wrong to me. As does including >>> pandas as a dependency just to use this small piece of >>> functionality. Grouping is a more general functionality than any particular >>> method of organizing your data. >>> >>> In terms of features, adding transformations and filtering might be nice >>> too; I hadn't thought about it, but that is because unlike the currently >>> implemented features, the need has never arose for me. Im only a small >>> sample size, and I don't see any fundamental objection to adding such >>> functionality though. It certainly raises the question as to where to draw >>> the line with pandas; but my rule of thumb is that if you can think of it >>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>> >>> >> Oh I forgot to add: with an indexing mechanism based on sorting, unique >> values and counts also come 'for free', not counting the O(N) cost of >> actually creating those arrays. The only time an operating relying on an >> index incurs another nontrivial amount of overhead is in case its 'rank' or >> 'inverse' property is used, which invokes another argsort. But for the vast >> majority of grouping or set operations, these properties are never used. >> > > That extra argsort is now gone from master: > > https://github.com/numpy/numpy/pull/5012 > > Even with this improvement, returning any index typically makes > `np.unique` run at least 2x slower: > > In [1]: import numpy as np > In [2]: a = np.random.randint(100, size=(1000,)) > In [3]: %timeit np.unique(a) > 10000 loops, best of 3: 37.3 us per loop > In [4]: %timeit np.unique(a, return_inverse=True) > 10000 loops, best of 3: 62.1 us per loop > In [5]: %timeit np.unique(a, return_index=True) > 10000 loops, best of 3: 72.8 us per loop > In [6]: %timeit np.unique(a, return_counts=True) > 10000 loops, best of 3: 56.4 us per loop > In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, > return_coun > ts=True) > 10000 loops, best of 3: 112 us per loop > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Thu Sep 4 14:29:07 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 20:29:07 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > I should clarify: I am speaking about my implementation, I havnt looked at > the numpy implementation for a while so im not sure what it is up to. Note > that by 'almost free', we are still talking about three passes over the > whole array plus temp allocations, but I am assuming a use-case where the > various sorts involved are the dominant cost, which I imagine they are, for > out-of-cache sorts. Perhaps this isn't too realistic an assumption about > the average use case though, I don't know. Though I suppose its a > reasonable guideline to assume that either the dataset is big, or > performance isn't that big a concern in the first place. > > > On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < >> hoogendoorn.eelco at gmail.com> wrote: >> >>> >>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >>> hoogendoorn.eelco at gmail.com> wrote: >>> >>>> >>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < >>>> jaime.frio at gmail.com> wrote: >>>> >>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >>>>> jaime.frio at gmail.com> wrote: >>>>> >>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>>> hoogendoorn.eelco at gmail.com> wrote: >>>>>> >>>>>>> Not sure about the hashing. Indeed one can also build an index of a >>>>>>> set by means of a hash table, but its questionable if this leads to >>>>>>> improved performance over performing an argsort. Hashing may have better >>>>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>>> (but need not) make one random access pass to build the index, and may (but >>>>>>> need not) perform one random access to reorder values for grouping. But >>>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>>> will be exploited. >>>>>>> >>>>>> >>>>>> If you want to give it a try, these branch of my numpy fork has hash >>>>>> table based implementations of unique (with no extra indices) and in1d: >>>>>> >>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>>> >>>>>> A use cases where the hash table is clearly better: >>>>>> >>>>>> In [1]: import numpy as np >>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>>> >>>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>>> In [4]: %timeit np.unique(a) >>>>>> 1000 loops, best of 3: 258 us per loop >>>>>> In [5]: %timeit _unique(a) >>>>>> 10000 loops, best of 3: 143 us per loop >>>>>> In [6]: %timeit np.sort(_unique(a)) >>>>>> 10000 loops, best of 3: 149 us per loop >>>>>> >>>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>>> performance to dig out: have type specific comparison functions, optimize >>>>>> the starting hash table size based on the size of the array to avoid >>>>>> reinsertions... >>>>>> >>>>>> If getting the elements sorted is a necessity, and the array contains >>>>>> very few or no repeated items, then the hash table approach may even >>>>>> perform worse,: >>>>>> >>>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>>> In [9]: %timeit np.unique(a) >>>>>> 1000 loops, best of 3: 277 us per loop >>>>>> In [10]: %timeit np.sort(_unique(a)) >>>>>> 1000 loops, best of 3: 320 us per loop >>>>>> >>>>>> But the hash table still wins in extracting the unique items only: >>>>>> >>>>>> In [11]: %timeit _unique(a) >>>>>> 10000 loops, best of 3: 187 us per loop >>>>>> >>>>>> Where the hash table shines is in more elaborate situations. If you >>>>>> keep the first index where it was found, and the number of repeats, in the >>>>>> hash table, you can get return_index and return_counts almost for free, >>>>>> which means you are performing an extra 3x faster than with sorting. >>>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>>>> it, there is close to an order of magnitude overall improvement >>>>>> >>>>>> The spped-up for in1d is also nice: >>>>>> >>>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>>> In [18]: %timeit np.in1d(a, b) >>>>>> 1000 loops, best of 3: 178 us per loop >>>>>> In [19]: %timeit _in1d(a, b) >>>>>> 10000 loops, best of 3: 30.1 us per loop >>>>>> >>>>>> Of course, there is no point in >>>>>> >>>>> >>>>> Ooops!!! Hit the send button too quick. Not to extend myself too long: >>>>> if we are going to rethink all of this, we should approach it with an open >>>>> mind. Still, and this post is not helping with that either, I am afraid >>>>> that we are discussing implementation details, but are missing a broader >>>>> vision of what we want to accomplish and why. That vision of what numpy's >>>>> grouping functionality, if any, should be, and how it complements or >>>>> conflicts with what pandas is providing, should precede anything else. I >>>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>>> Their documentation on the subject is well worth a read: >>>>> >>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>>> >>>>> Does numpy need to replicate this? What/why/how can we add to that? >>>>> >>>>> Jaime >>>>> >>>>> -- >>>>> (\__/) >>>>> ( O.o) >>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>>>> planes de dominaci?n mundial. >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion at scipy.org >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>> >>>>> I would certainly not be opposed to having a hashing based indexing >>>> mechanism; I think it would make sense design-wise to have a HashIndex >>>> class with the same interface as the rest, and use that subclass in those >>>> arraysetops where it makes sense. The 'how to' of indexing and its >>>> applications are largely orthogonal I think (with some tiny performance >>>> compromises which are worth the abstraction imo). For datasets which are >>>> not purely random, have many unique items, and which do not fit into cache, >>>> I would expect sorting to come out on top, but indeed it depends on the >>>> dataset. >>>> >>>> Yeah, the question how pandas does grouping, and whether we can do >>>> better, is a relevant one. >>>> >>>> From what I understand, pandas relies on cython extensions to get >>>> vectorized grouping functionality. This is no longer necessary since the >>>> introduction of ufuncs in numpy. I don't know how the implementations >>>> compare in terms of performance, but I doubt the difference is huge. >>>> >>>> I personally use grouping a lot in my code, and I don't like having to >>>> use pandas for it. Most importantly, I don't want to go around creating a >>>> dataframe for a single one-line hit-and-run association between keys and >>>> values. The permanent association of different types of data and their >>>> metadata which pandas offers is I think the key difference from numpy, >>>> which is all about manipulating just plain ndarrays. Arguably, grouping >>>> itself is a pretty elementary manipulating of ndarrays, and adding calls to >>>> DataFrame or Series inbetween a statement that could just be >>>> simply group_by(keys).mean(values) feels wrong to me. As does including >>>> pandas as a dependency just to use this small piece of >>>> functionality. Grouping is a more general functionality than any particular >>>> method of organizing your data. >>>> >>>> In terms of features, adding transformations and filtering might be >>>> nice too; I hadn't thought about it, but that is because unlike the >>>> currently implemented features, the need has never arose for me. Im only a >>>> small sample size, and I don't see any fundamental objection to adding such >>>> functionality though. It certainly raises the question as to where to draw >>>> the line with pandas; but my rule of thumb is that if you can think of it >>>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>>> >>>> >>> Oh I forgot to add: with an indexing mechanism based on sorting, unique >>> values and counts also come 'for free', not counting the O(N) cost of >>> actually creating those arrays. The only time an operating relying on an >>> index incurs another nontrivial amount of overhead is in case its 'rank' or >>> 'inverse' property is used, which invokes another argsort. But for the vast >>> majority of grouping or set operations, these properties are never used. >>> >> >> That extra argsort is now gone from master: >> >> https://github.com/numpy/numpy/pull/5012 >> >> Even with this improvement, returning any index typically makes >> `np.unique` run at least 2x slower: >> >> In [1]: import numpy as np >> In [2]: a = np.random.randint(100, size=(1000,)) >> In [3]: %timeit np.unique(a) >> 10000 loops, best of 3: 37.3 us per loop >> In [4]: %timeit np.unique(a, return_inverse=True) >> 10000 loops, best of 3: 62.1 us per loop >> In [5]: %timeit np.unique(a, return_index=True) >> 10000 loops, best of 3: 72.8 us per loop >> In [6]: %timeit np.unique(a, return_counts=True) >> 10000 loops, best of 3: 56.4 us per loop >> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, >> return_coun >> ts=True) >> 10000 loops, best of 3: 112 us per loop >> >> -- >> (\__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes >> de dominaci?n mundial. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > Yeah, I looked at the numpy implementation, and it seems these speed differences are simply the result of the extra O(N) costs involved, so my implementation would have the same characteristics. If another array copy or two has meaningful impact on performance, then you are done as far as optimization within numpy is concerned, id say. You could fuse those loops on the C level, but as you said I think its better to think about these kind of optimizations once we have a more complete picture of the functionality we want. Good call on removing the extra argsort, that hadn't occurred to me. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Thu Sep 4 14:36:58 2014 From: jeffreback at gmail.com (Jeff Reback) Date: Thu, 4 Sep 2014 14:36:58 -0400 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: FYI pandas DOES use a very performant hash table impl for unique (and value_counts). Sorted state IS maintained by underlying Index implmentation. https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx In [8]: a = np.random.randint(10, size=(10000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 284 ?s per loop In [10]: %timeit Series(a).unique() 10000 loops, best of 3: 161 ?s per loop In [11]: s = Series(a) # without the creation overhead In [12]: %timeit s.unique() 10000 loops, best of 3: 75.3 ?s per loop On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > > On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> I should clarify: I am speaking about my implementation, I havnt looked >> at the numpy implementation for a while so im not sure what it is up to. >> Note that by 'almost free', we are still talking about three passes over >> the whole array plus temp allocations, but I am assuming a use-case where >> the various sorts involved are the dominant cost, which I imagine they are, >> for out-of-cache sorts. Perhaps this isn't too realistic an assumption >> about the average use case though, I don't know. Though I suppose its a >> reasonable guideline to assume that either the dataset is big, or >> performance isn't that big a concern in the first place. >> >> >> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fern?ndez del R?o < >> jaime.frio at gmail.com> wrote: >> >>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < >>> hoogendoorn.eelco at gmail.com> wrote: >>> >>>> >>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >>>> hoogendoorn.eelco at gmail.com> wrote: >>>> >>>>> >>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < >>>>> jaime.frio at gmail.com> wrote: >>>>> >>>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >>>>>> jaime.frio at gmail.com> wrote: >>>>>> >>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>>>> hoogendoorn.eelco at gmail.com> wrote: >>>>>>> >>>>>>>> Not sure about the hashing. Indeed one can also build an index of >>>>>>>> a set by means of a hash table, but its questionable if this leads to >>>>>>>> improved performance over performing an argsort. Hashing may have better >>>>>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>>>> (but need not) make one random access pass to build the index, and may (but >>>>>>>> need not) perform one random access to reorder values for grouping. But >>>>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>>>> will be exploited. >>>>>>>> >>>>>>> >>>>>>> If you want to give it a try, these branch of my numpy fork has hash >>>>>>> table based implementations of unique (with no extra indices) and in1d: >>>>>>> >>>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>>>> >>>>>>> A use cases where the hash table is clearly better: >>>>>>> >>>>>>> In [1]: import numpy as np >>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>>>> >>>>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>>>> In [4]: %timeit np.unique(a) >>>>>>> 1000 loops, best of 3: 258 us per loop >>>>>>> In [5]: %timeit _unique(a) >>>>>>> 10000 loops, best of 3: 143 us per loop >>>>>>> In [6]: %timeit np.sort(_unique(a)) >>>>>>> 10000 loops, best of 3: 149 us per loop >>>>>>> >>>>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>>>> performance to dig out: have type specific comparison functions, optimize >>>>>>> the starting hash table size based on the size of the array to avoid >>>>>>> reinsertions... >>>>>>> >>>>>>> If getting the elements sorted is a necessity, and the array >>>>>>> contains very few or no repeated items, then the hash table approach may >>>>>>> even perform worse,: >>>>>>> >>>>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>>>> In [9]: %timeit np.unique(a) >>>>>>> 1000 loops, best of 3: 277 us per loop >>>>>>> In [10]: %timeit np.sort(_unique(a)) >>>>>>> 1000 loops, best of 3: 320 us per loop >>>>>>> >>>>>>> But the hash table still wins in extracting the unique items only: >>>>>>> >>>>>>> In [11]: %timeit _unique(a) >>>>>>> 10000 loops, best of 3: 187 us per loop >>>>>>> >>>>>>> Where the hash table shines is in more elaborate situations. If you >>>>>>> keep the first index where it was found, and the number of repeats, in the >>>>>>> hash table, you can get return_index and return_counts almost for free, >>>>>>> which means you are performing an extra 3x faster than with sorting. >>>>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>>>>> it, there is close to an order of magnitude overall improvement >>>>>>> >>>>>>> The spped-up for in1d is also nice: >>>>>>> >>>>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>>>> In [18]: %timeit np.in1d(a, b) >>>>>>> 1000 loops, best of 3: 178 us per loop >>>>>>> In [19]: %timeit _in1d(a, b) >>>>>>> 10000 loops, best of 3: 30.1 us per loop >>>>>>> >>>>>>> Of course, there is no point in >>>>>>> >>>>>> >>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too >>>>>> long: if we are going to rethink all of this, we should approach it with an >>>>>> open mind. Still, and this post is not helping with that either, I am >>>>>> afraid that we are discussing implementation details, but are missing a >>>>>> broader vision of what we want to accomplish and why. That vision of what >>>>>> numpy's grouping functionality, if any, should be, and how it complements >>>>>> or conflicts with what pandas is providing, should precede anything else. I >>>>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>>>> Their documentation on the subject is well worth a read: >>>>>> >>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>>>> >>>>>> Does numpy need to replicate this? What/why/how can we add to that? >>>>>> >>>>>> Jaime >>>>>> >>>>>> -- >>>>>> (\__/) >>>>>> ( O.o) >>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>>>>> planes de dominaci?n mundial. >>>>>> >>>>>> _______________________________________________ >>>>>> NumPy-Discussion mailing list >>>>>> NumPy-Discussion at scipy.org >>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>> >>>>>> I would certainly not be opposed to having a hashing based indexing >>>>> mechanism; I think it would make sense design-wise to have a HashIndex >>>>> class with the same interface as the rest, and use that subclass in those >>>>> arraysetops where it makes sense. The 'how to' of indexing and its >>>>> applications are largely orthogonal I think (with some tiny performance >>>>> compromises which are worth the abstraction imo). For datasets which are >>>>> not purely random, have many unique items, and which do not fit into cache, >>>>> I would expect sorting to come out on top, but indeed it depends on the >>>>> dataset. >>>>> >>>>> Yeah, the question how pandas does grouping, and whether we can do >>>>> better, is a relevant one. >>>>> >>>>> From what I understand, pandas relies on cython extensions to get >>>>> vectorized grouping functionality. This is no longer necessary since the >>>>> introduction of ufuncs in numpy. I don't know how the implementations >>>>> compare in terms of performance, but I doubt the difference is huge. >>>>> >>>>> I personally use grouping a lot in my code, and I don't like having to >>>>> use pandas for it. Most importantly, I don't want to go around creating a >>>>> dataframe for a single one-line hit-and-run association between keys and >>>>> values. The permanent association of different types of data and their >>>>> metadata which pandas offers is I think the key difference from numpy, >>>>> which is all about manipulating just plain ndarrays. Arguably, grouping >>>>> itself is a pretty elementary manipulating of ndarrays, and adding calls to >>>>> DataFrame or Series inbetween a statement that could just be >>>>> simply group_by(keys).mean(values) feels wrong to me. As does including >>>>> pandas as a dependency just to use this small piece of >>>>> functionality. Grouping is a more general functionality than any particular >>>>> method of organizing your data. >>>>> >>>>> In terms of features, adding transformations and filtering might be >>>>> nice too; I hadn't thought about it, but that is because unlike the >>>>> currently implemented features, the need has never arose for me. Im only a >>>>> small sample size, and I don't see any fundamental objection to adding such >>>>> functionality though. It certainly raises the question as to where to draw >>>>> the line with pandas; but my rule of thumb is that if you can think of it >>>>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>>>> >>>>> >>>> Oh I forgot to add: with an indexing mechanism based on sorting, unique >>>> values and counts also come 'for free', not counting the O(N) cost of >>>> actually creating those arrays. The only time an operating relying on an >>>> index incurs another nontrivial amount of overhead is in case its 'rank' or >>>> 'inverse' property is used, which invokes another argsort. But for the vast >>>> majority of grouping or set operations, these properties are never used. >>>> >>> >>> That extra argsort is now gone from master: >>> >>> https://github.com/numpy/numpy/pull/5012 >>> >>> Even with this improvement, returning any index typically makes >>> `np.unique` run at least 2x slower: >>> >>> In [1]: import numpy as np >>> In [2]: a = np.random.randint(100, size=(1000,)) >>> In [3]: %timeit np.unique(a) >>> 10000 loops, best of 3: 37.3 us per loop >>> In [4]: %timeit np.unique(a, return_inverse=True) >>> 10000 loops, best of 3: 62.1 us per loop >>> In [5]: %timeit np.unique(a, return_index=True) >>> 10000 loops, best of 3: 72.8 us per loop >>> In [6]: %timeit np.unique(a, return_counts=True) >>> 10000 loops, best of 3: 56.4 us per loop >>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, >>> return_coun >>> ts=True) >>> 10000 loops, best of 3: 112 us per loop >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>> planes de dominaci?n mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> > Yeah, I looked at the numpy implementation, and it seems these speed > differences are simply the result of the extra O(N) costs involved, so my > implementation would have the same characteristics. If another array copy > or two has meaningful impact on performance, then you are done as far as > optimization within numpy is concerned, id say. You could fuse those loops > on the C level, but as you said I think its better to think about these > kind of optimizations once we have a more complete picture of the > functionality we want. > > Good call on removing the extra argsort, that hadn't occurred to me. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Thu Sep 4 15:19:14 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 4 Sep 2014 21:19:14 +0200 Subject: [Numpy-discussion] Does a `mergesorted` function make sense? In-Reply-To: References: Message-ID: Naturally, youd want to avoid redoing the indexing where you can, which is another good reason to factor out the indexing mechanisms into separate classes. A factor two performance difference does not get me too excited; again, I think it would be the other way around for an out-of-cache dataset being grouped. But this by itself is ofcourse another argument for factoring out the indexing behind a uniform interface, so we can play around with those implementation details later, and specialize the indexing to serve different scenarios. Also, it really helps with code maintainability; most arraysetops are almost trivial to implement once you have abstracted away the indexing machinery. On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback wrote: > FYI pandas DOES use a very performant hash table impl for unique (and > value_counts). Sorted state IS maintained > by underlying Index implmentation. > https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx > > In [8]: a = np.random.randint(10, size=(10000,)) > > In [9]: %timeit np.unique(a) > 1000 loops, best of 3: 284 ?s per loop > > In [10]: %timeit Series(a).unique() > 10000 loops, best of 3: 161 ?s per loop > > In [11]: s = Series(a) > > # without the creation overhead > In [12]: %timeit s.unique() > 10000 loops, best of 3: 75.3 ?s per loop > > > > On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> >> On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < >> hoogendoorn.eelco at gmail.com> wrote: >> >>> I should clarify: I am speaking about my implementation, I havnt looked >>> at the numpy implementation for a while so im not sure what it is up to. >>> Note that by 'almost free', we are still talking about three passes over >>> the whole array plus temp allocations, but I am assuming a use-case where >>> the various sorts involved are the dominant cost, which I imagine they are, >>> for out-of-cache sorts. Perhaps this isn't too realistic an assumption >>> about the average use case though, I don't know. Though I suppose its a >>> reasonable guideline to assume that either the dataset is big, or >>> performance isn't that big a concern in the first place. >>> >>> >>> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fern?ndez del R?o < >>> jaime.frio at gmail.com> wrote: >>> >>>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < >>>> hoogendoorn.eelco at gmail.com> wrote: >>>> >>>>> >>>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >>>>> hoogendoorn.eelco at gmail.com> wrote: >>>>> >>>>>> >>>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fern?ndez del R?o < >>>>>> jaime.frio at gmail.com> wrote: >>>>>> >>>>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fern?ndez del R?o < >>>>>>> jaime.frio at gmail.com> wrote: >>>>>>> >>>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>>>>> hoogendoorn.eelco at gmail.com> wrote: >>>>>>>> >>>>>>>>> Not sure about the hashing. Indeed one can also build an index of >>>>>>>>> a set by means of a hash table, but its questionable if this leads to >>>>>>>>> improved performance over performing an argsort. Hashing may have better >>>>>>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>>>>> (but need not) make one random access pass to build the index, and may (but >>>>>>>>> need not) perform one random access to reorder values for grouping. But >>>>>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>>>>> will be exploited. >>>>>>>>> >>>>>>>> >>>>>>>> If you want to give it a try, these branch of my numpy fork has >>>>>>>> hash table based implementations of unique (with no extra indices) and in1d: >>>>>>>> >>>>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>>>>> >>>>>>>> A use cases where the hash table is clearly better: >>>>>>>> >>>>>>>> In [1]: import numpy as np >>>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>>>>> >>>>>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>>>>> In [4]: %timeit np.unique(a) >>>>>>>> 1000 loops, best of 3: 258 us per loop >>>>>>>> In [5]: %timeit _unique(a) >>>>>>>> 10000 loops, best of 3: 143 us per loop >>>>>>>> In [6]: %timeit np.sort(_unique(a)) >>>>>>>> 10000 loops, best of 3: 149 us per loop >>>>>>>> >>>>>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>>>>> performance to dig out: have type specific comparison functions, optimize >>>>>>>> the starting hash table size based on the size of the array to avoid >>>>>>>> reinsertions... >>>>>>>> >>>>>>>> If getting the elements sorted is a necessity, and the array >>>>>>>> contains very few or no repeated items, then the hash table approach may >>>>>>>> even perform worse,: >>>>>>>> >>>>>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>>>>> In [9]: %timeit np.unique(a) >>>>>>>> 1000 loops, best of 3: 277 us per loop >>>>>>>> In [10]: %timeit np.sort(_unique(a)) >>>>>>>> 1000 loops, best of 3: 320 us per loop >>>>>>>> >>>>>>>> But the hash table still wins in extracting the unique items only: >>>>>>>> >>>>>>>> In [11]: %timeit _unique(a) >>>>>>>> 10000 loops, best of 3: 187 us per loop >>>>>>>> >>>>>>>> Where the hash table shines is in more elaborate situations. If you >>>>>>>> keep the first index where it was found, and the number of repeats, in the >>>>>>>> hash table, you can get return_index and return_counts almost for free, >>>>>>>> which means you are performing an extra 3x faster than with sorting. >>>>>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>>>>>> it, there is close to an order of magnitude overall improvement >>>>>>>> >>>>>>>> The spped-up for in1d is also nice: >>>>>>>> >>>>>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>>>>> In [18]: %timeit np.in1d(a, b) >>>>>>>> 1000 loops, best of 3: 178 us per loop >>>>>>>> In [19]: %timeit _in1d(a, b) >>>>>>>> 10000 loops, best of 3: 30.1 us per loop >>>>>>>> >>>>>>>> Of course, there is no point in >>>>>>>> >>>>>>> >>>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too >>>>>>> long: if we are going to rethink all of this, we should approach it with an >>>>>>> open mind. Still, and this post is not helping with that either, I am >>>>>>> afraid that we are discussing implementation details, but are missing a >>>>>>> broader vision of what we want to accomplish and why. That vision of what >>>>>>> numpy's grouping functionality, if any, should be, and how it complements >>>>>>> or conflicts with what pandas is providing, should precede anything else. I >>>>>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>>>>> Their documentation on the subject is well worth a read: >>>>>>> >>>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>>>>> >>>>>>> Does numpy need to replicate this? What/why/how can we add to that? >>>>>>> >>>>>>> Jaime >>>>>>> >>>>>>> -- >>>>>>> (\__/) >>>>>>> ( O.o) >>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>>>>>> planes de dominaci?n mundial. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> NumPy-Discussion mailing list >>>>>>> NumPy-Discussion at scipy.org >>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>>> >>>>>>> I would certainly not be opposed to having a hashing based indexing >>>>>> mechanism; I think it would make sense design-wise to have a HashIndex >>>>>> class with the same interface as the rest, and use that subclass in those >>>>>> arraysetops where it makes sense. The 'how to' of indexing and its >>>>>> applications are largely orthogonal I think (with some tiny performance >>>>>> compromises which are worth the abstraction imo). For datasets which are >>>>>> not purely random, have many unique items, and which do not fit into cache, >>>>>> I would expect sorting to come out on top, but indeed it depends on the >>>>>> dataset. >>>>>> >>>>>> Yeah, the question how pandas does grouping, and whether we can do >>>>>> better, is a relevant one. >>>>>> >>>>>> From what I understand, pandas relies on cython extensions to get >>>>>> vectorized grouping functionality. This is no longer necessary since the >>>>>> introduction of ufuncs in numpy. I don't know how the implementations >>>>>> compare in terms of performance, but I doubt the difference is huge. >>>>>> >>>>>> I personally use grouping a lot in my code, and I don't like >>>>>> having to use pandas for it. Most importantly, I don't want to go around >>>>>> creating a dataframe for a single one-line hit-and-run association between >>>>>> keys and values. The permanent association of different types of data and >>>>>> their metadata which pandas offers is I think the key difference from >>>>>> numpy, which is all about manipulating just plain ndarrays. Arguably, >>>>>> grouping itself is a pretty elementary manipulating of ndarrays, and adding >>>>>> calls to DataFrame or Series inbetween a statement that could just be >>>>>> simply group_by(keys).mean(values) feels wrong to me. As does including >>>>>> pandas as a dependency just to use this small piece of >>>>>> functionality. Grouping is a more general functionality than any particular >>>>>> method of organizing your data. >>>>>> >>>>>> In terms of features, adding transformations and filtering might be >>>>>> nice too; I hadn't thought about it, but that is because unlike the >>>>>> currently implemented features, the need has never arose for me. Im only a >>>>>> small sample size, and I don't see any fundamental objection to adding such >>>>>> functionality though. It certainly raises the question as to where to draw >>>>>> the line with pandas; but my rule of thumb is that if you can think of it >>>>>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>>>>> >>>>>> >>>>> Oh I forgot to add: with an indexing mechanism based on sorting, >>>>> unique values and counts also come 'for free', not counting the O(N) cost >>>>> of actually creating those arrays. The only time an operating relying on an >>>>> index incurs another nontrivial amount of overhead is in case its 'rank' or >>>>> 'inverse' property is used, which invokes another argsort. But for the vast >>>>> majority of grouping or set operations, these properties are never used. >>>>> >>>> >>>> That extra argsort is now gone from master: >>>> >>>> https://github.com/numpy/numpy/pull/5012 >>>> >>>> Even with this improvement, returning any index typically makes >>>> `np.unique` run at least 2x slower: >>>> >>>> In [1]: import numpy as np >>>> In [2]: a = np.random.randint(100, size=(1000,)) >>>> In [3]: %timeit np.unique(a) >>>> 10000 loops, best of 3: 37.3 us per loop >>>> In [4]: %timeit np.unique(a, return_inverse=True) >>>> 10000 loops, best of 3: 62.1 us per loop >>>> In [5]: %timeit np.unique(a, return_index=True) >>>> 10000 loops, best of 3: 72.8 us per loop >>>> In [6]: %timeit np.unique(a, return_counts=True) >>>> 10000 loops, best of 3: 56.4 us per loop >>>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, >>>> return_coun >>>> ts=True) >>>> 10000 loops, best of 3: 112 us per loop >>>> >>>> -- >>>> (\__/) >>>> ( O.o) >>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>>> planes de dominaci?n mundial. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>> >> Yeah, I looked at the numpy implementation, and it seems these speed >> differences are simply the result of the extra O(N) costs involved, so my >> implementation would have the same characteristics. If another array copy >> or two has meaningful impact on performance, then you are done as far as >> optimization within numpy is concerned, id say. You could fuse those loops >> on the C level, but as you said I think its better to think about these >> kind of optimizations once we have a more complete picture of the >> functionality we want. >> >> Good call on removing the extra argsort, that hadn't occurred to me. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Sep 4 16:40:02 2014 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 4 Sep 2014 21:40:02 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 12:32 PM, Neal Becker wrote: > http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html What would you like to say about it? -- Robert Kern From joseph.martinot-lagarde at m4x.org Thu Sep 4 20:47:28 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Fri, 05 Sep 2014 02:47:28 +0200 Subject: [Numpy-discussion] 'norm' keyword for FFT functions Message-ID: I have an old PR [1] to fix #2142 [2]. The idea is to have a new keyword for all fft functions to define the normalisation of the fft: - if 'norm' is None (the default), the normalisation is the current one: fft() is not normalized ans ifft is normalized by 1/n. - if norm is "ortho", the direct and inverse transforms are both normalized by 1/sqrt(n). The results are then unitary. The keyword name and value is consistent with scipy.fftpack.dct. Do you feel that it should be merged ? Joseph [1] https://github.com/numpy/numpy/pull/3883 [2] https://github.com/numpy/numpy/issues/2142 From joseph.martinot-lagarde at m4x.org Thu Sep 4 20:46:51 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Fri, 05 Sep 2014 02:46:51 +0200 Subject: [Numpy-discussion] Multiple comment tokens for loadtxt Message-ID: loadtxt currently has a keyword to change the comment token. The PR #4612 [1] enables to define multiple comment token for a file. It is motivated by #2633 [2] What is your position on this one ? Joseph [1] https://github.com/numpy/numpy/pull/4612 [2] https://github.com/numpy/numpy/issues/2633 From ndbecker2 at gmail.com Fri Sep 5 07:05:46 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 05 Sep 2014 07:05:46 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: Message-ID: Robert Kern wrote: > On Thu, Sep 4, 2014 at 12:32 PM, Neal Becker wrote: >> http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html > > What would you like to say about it? > If it is faster (and at least as good), maybe we'd like to adopt it to replace that used for mtrand -- -- Those who don't understand recursion are doomed to repeat it From robert.kern at gmail.com Fri Sep 5 07:13:33 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 5 Sep 2014 12:13:33 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: On Fri, Sep 5, 2014 at 12:05 PM, Neal Becker wrote: > Robert Kern wrote: > >> On Thu, Sep 4, 2014 at 12:32 PM, Neal Becker wrote: >>> http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html >> >> What would you like to say about it? >> > > If it is faster (and at least as good), maybe we'd like to adopt it to replace > that used for mtrand It's a variant of the standard MT rather than just an implementation of it, so we can't just drop it in. You will need to build the infrastructure to support multiple PRNGs first (or rather, build the infrastructure to reuse the non-uniform distribution code with multiple core PRNGs). -- Robert Kern From ndbecker2 at gmail.com Fri Sep 5 13:19:57 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 05 Sep 2014 13:19:57 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: Message-ID: Robert Kern wrote: > On Fri, Sep 5, 2014 at 12:05 PM, Neal Becker wrote: >> Robert Kern wrote: >> >>> On Thu, Sep 4, 2014 at 12:32 PM, Neal Becker wrote: >>>> http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html >>> >>> What would you like to say about it? >>> >> >> If it is faster (and at least as good), maybe we'd like to adopt it to >> replace that used for mtrand > > It's a variant of the standard MT rather than just an implementation > of it, so we can't just drop it in. You will need to build the > infrastructure to support multiple PRNGs first (or rather, build the > infrastructure to reuse the non-uniform distribution code with > multiple core PRNGs). > You mean it's not backward compatible because it won't generate exactly the same sequence of output for a given seed, and therefore we wouldn't want to make that change? I think it's somewhat debatable whether generating a different sequence of random numbers counts as breaking backward compatibility. -- -- Those who don't understand recursion are doomed to repeat it From robert.kern at gmail.com Fri Sep 5 13:28:14 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 5 Sep 2014 18:28:14 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: On Fri, Sep 5, 2014 at 6:19 PM, Neal Becker wrote: > Robert Kern wrote: > >> On Fri, Sep 5, 2014 at 12:05 PM, Neal Becker wrote: >>> Robert Kern wrote: >>> >>>> On Thu, Sep 4, 2014 at 12:32 PM, Neal Becker wrote: >>>>> http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html >>>> >>>> What would you like to say about it? >>>> >>> >>> If it is faster (and at least as good), maybe we'd like to adopt it to >>> replace that used for mtrand >> >> It's a variant of the standard MT rather than just an implementation >> of it, so we can't just drop it in. You will need to build the >> infrastructure to support multiple PRNGs first (or rather, build the >> infrastructure to reuse the non-uniform distribution code with >> multiple core PRNGs). > > You mean it's not backward compatible because it won't generate exactly the same > sequence of output for a given seed, and therefore we wouldn't want to make that > change? > > I think it's somewhat debatable whether generating a different sequence of > random numbers counts as breaking backward compatibility. It's not a matter of debate over the semantics of the term "backwards compatibility". This is a policy that we have explicitly chosen for numpy.random (distinct from our backwards compatibility policy for the rest of numpy) because it was requested of us. -- Robert Kern From hodge at stsci.edu Fri Sep 5 13:26:29 2014 From: hodge at stsci.edu (Phil Hodge) Date: Fri, 5 Sep 2014 13:26:29 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: <5409F245.50501@stsci.edu> On 09/05/2014 01:19 PM, Neal Becker wrote: > You mean it's not backward compatible because it won't generate exactly the same > sequence of output for a given seed, and therefore we wouldn't want to make that > change? > > I think it's somewhat debatable whether generating a different sequence of > random numbers counts as breaking backward compatibility. For regression tests it's essential to be able to generate the same sequence of values from a pseudo-random number generator. Phil From jhaiduce at gmail.com Fri Sep 5 13:40:52 2014 From: jhaiduce at gmail.com (John Haiducek) Date: Fri, 5 Sep 2014 13:40:52 -0400 Subject: [Numpy-discussion] numpy.vectorize docstrings not shown by help command Message-ID: When I apply numpy.vectorize() to a function, documentation tools behave inconsistently with regard to the new, vectorized function. The function's __doc__ attribute does contain the docstring of the original function as expected, but the built-in help() command displays the documentation of the numpy.vectorize class, and sphinx-autodoc fails to display the function at all. Is there a way to get the docstring of the original function display everywhere as expected? For instance: >>> import numpy as np >>> def myfunc(x): ... "Square x" ... return x**2 ... >>> myfunc=np.vectorize(myfunc) >>> print myfunc.__doc__ Square x >>> help(myfunc) (displays documentation of np.vectorize) From alan.isaac at gmail.com Fri Sep 5 15:17:00 2014 From: alan.isaac at gmail.com (Alan G Isaac) Date: Fri, 05 Sep 2014 15:17:00 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: <540A0C2C.8030806@gmail.com> On 9/5/2014 1:19 PM, Neal Becker wrote: > I think it's somewhat debatable whether generating a different sequence of > random numbers counts as breaking backward compatibility. Please: it does. Alan Isaac From sturla.molden at gmail.com Fri Sep 5 16:36:52 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Fri, 05 Sep 2014 22:36:52 +0200 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: On 05/09/14 19:19, Neal Becker wrote: >> It's a variant of the standard MT rather than just an implementation >> of it, so we can't just drop it in. You will need to build the >> infrastructure to support multiple PRNGs first (or rather, build the >> infrastructure to reuse the non-uniform distribution code with >> multiple core PRNGs). >> > > You mean it's not backward compatible because it won't generate exactly the same > sequence of output for a given seed, and therefore we wouldn't want to make that > change? I thought he meant that the standard MT and SFMT have global states, whereas NumPy's randomkit MT does not. Because of this, NumPy allows you to use multiple RandomState instances. With a vanilla MT or SFMT there would be just one. Sturla From robert.kern at gmail.com Fri Sep 5 18:42:09 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 5 Sep 2014 23:42:09 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: On Fri, Sep 5, 2014 at 9:36 PM, Sturla Molden wrote: > On 05/09/14 19:19, Neal Becker wrote: > >> It's a variant of the standard MT rather than just an implementation > >> of it, so we can't just drop it in. You will need to build the > >> infrastructure to support multiple PRNGs first (or rather, build the > >> infrastructure to reuse the non-uniform distribution code with > >> multiple core PRNGs). > > > > You mean it's not backward compatible because it won't generate > exactly the same > > sequence of output for a given seed, and therefore we wouldn't want > to make that > > change? > > I thought he meant that the standard MT and SFMT have global states, > whereas NumPy's randomkit MT does not. Because of this, NumPy allows you > to use multiple RandomState instances. With a vanilla MT or SFMT there > would be just one. No, that is not what I meant. If the SFMT can be made to output the same bitstream for the same seed, we can use it (modifying it if necessary to avoid global state if necessary), but it does not look so to me. I welcome corrections on that count (in PR form, preferably!). -- Robert Kern From sturla.molden at gmail.com Fri Sep 5 19:50:14 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Fri, 5 Sep 2014 23:50:14 +0000 (UTC) Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: Message-ID: <664646241431653467.005819sturla.molden-gmail.com@news.gmane.org> Robert Kern wrote: > No, that is not what I meant. If the SFMT can be made to output the > same bitstream for the same seed, we can use it (modifying it if > necessary to avoid global state if necessary), but it does not look so > to me. I welcome corrections on that count (in PR form, preferably!). Saito's SFMT master thesis says SFMT as a different equidistribution, so it will not give the same bitstream. But it also mentions a SIMD version of the standard MT. From jbednar at inf.ed.ac.uk Sat Sep 6 13:37:32 2014 From: jbednar at inf.ed.ac.uk (James A. Bednar) Date: Sat, 6 Sep 2014 18:37:32 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: Message-ID: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> | Date: Fri, 05 Sep 2014 13:19:57 -0400 | From: Neal Becker | | I think it's somewhat debatable whether generating a different | sequence of random numbers counts as breaking backward | compatibility. Please don't ever, ever break the sequence of numpy's random numbers! Please! We have put a lot of effort into being able to reproduce our published work exactly, and all of that would be in vain if the sequence changes. See e.g.: http://journal.frontiersin.org/Journal/10.3389/fninf.2013.00044/full We'd be very happy to see additional number generators appear *alongside* the existing ones, though, particularly if there are faster or otherwise better ones! Jim Bednar -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From jtaylor.debian at googlemail.com Sun Sep 7 06:33:19 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sun, 07 Sep 2014 12:33:19 +0200 Subject: [Numpy-discussion] ANN: NumPy 1.9.0 release Message-ID: <540C346F.8070109@googlemail.com> Hello, We are proud to announce the 1.9.0 release of NumPy. This release includes numerous performance improvements, most significantly the indexing code has been rewritten be a lot times faster for most cases and performance of using small arrays and scalars has almost doubled. Plenty of other functions have been improved too, nonzero, where, bincount, searchsorted, count_nonzero, floating point min/max, boolean argmin/argmax, triu/tril, masked sorting can be expected to perform significantly better in many cases. Also NumPy 1.9.0 releases the GIL for more functions, most notably indexing now releases it and the random modules state object has a private lock instead of using the GIL. This allows leveraging pure python threads more efficiently. In order to make working with arrays containing NaN values easier nanmedian and nanpercentile have been added which ignore these values. These functions and the regular median and percentile now also support generalized axis arguments that ufuncs already have, these allow reducing along multiple axis in one call. Please see the release notes for all the details. Please also take not of the many small compatibility changes and deprecation in the notes. https://github.com/numpy/numpy/blob/maintenance/1.9.x/doc/release/1.9.0-notes.rst The source tarballs and win32 binaries can be downloaded here: https://sourceforge.net/projects/numpy/files/NumPy/1.9.0 Cheers, The NumPy Development Team -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From sebastian at sipsolutions.net Sun Sep 7 08:53:07 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sun, 07 Sep 2014 14:53:07 +0200 Subject: [Numpy-discussion] ANN: NumPy 1.9.0 release In-Reply-To: <540C346F.8070109@googlemail.com> References: <540C346F.8070109@googlemail.com> Message-ID: <1410094387.14955.0.camel@sebastian-t440> On So, 2014-09-07 at 12:33 +0200, Julian Taylor wrote: > Hello, > > We are proud to announce the 1.9.0 release of NumPy. > Awesome, thanks for the release management! - Sebastian > This release includes numerous performance improvements, most > significantly the indexing code has been rewritten be a lot times > faster for most cases and performance of using small arrays and scalars > has almost doubled. > Plenty of other functions have been improved too, nonzero, where, > bincount, searchsorted, count_nonzero, floating point min/max, boolean > argmin/argmax, triu/tril, masked sorting can be expected to perform > significantly better in many cases. > > Also NumPy 1.9.0 releases the GIL for more functions, most notably > indexing now releases it and the random modules state object has a > private lock instead of using the GIL. This allows leveraging pure > python threads more efficiently. > > In order to make working with arrays containing NaN values easier > nanmedian and nanpercentile have been added which ignore these values. > These functions and the regular median and percentile now also support > generalized axis arguments that ufuncs already have, these allow > reducing along multiple axis in one call. > > Please see the release notes for all the details. Please also take not > of the many small compatibility changes and deprecation in the notes. > https://github.com/numpy/numpy/blob/maintenance/1.9.x/doc/release/1.9.0-notes.rst > > The source tarballs and win32 binaries can be downloaded here: > https://sourceforge.net/projects/numpy/files/NumPy/1.9.0 > > Cheers, > The NumPy Development Team > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From charlesr.harris at gmail.com Sun Sep 7 10:07:27 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 7 Sep 2014 08:07:27 -0600 Subject: [Numpy-discussion] ANN: NumPy 1.9.0 release In-Reply-To: <540C346F.8070109@googlemail.com> References: <540C346F.8070109@googlemail.com> Message-ID: On Sun, Sep 7, 2014 at 4:33 AM, Julian Taylor wrote: > Hello, > > We are proud to announce the 1.9.0 release of NumPy. > > This release includes numerous performance improvements, most > significantly the indexing code has been rewritten be a lot times > faster for most cases and performance of using small arrays and scalars > has almost doubled. > Plenty of other functions have been improved too, nonzero, where, > bincount, searchsorted, count_nonzero, floating point min/max, boolean > argmin/argmax, triu/tril, masked sorting can be expected to perform > significantly better in many cases. > > Also NumPy 1.9.0 releases the GIL for more functions, most notably > indexing now releases it and the random modules state object has a > private lock instead of using the GIL. This allows leveraging pure > python threads more efficiently. > > In order to make working with arrays containing NaN values easier > nanmedian and nanpercentile have been added which ignore these values. > These functions and the regular median and percentile now also support > generalized axis arguments that ufuncs already have, these allow > reducing along multiple axis in one call. > > Please see the release notes for all the details. Please also take not > of the many small compatibility changes and deprecation in the notes. > > https://github.com/numpy/numpy/blob/maintenance/1.9.x/doc/release/1.9.0-notes.rst > > The source tarballs and win32 binaries can be downloaded here: > https://sourceforge.net/projects/numpy/files/NumPy/1.9.0 > > Cheers, > The NumPy Development Team > Great! Thanks for all the work you did getting this out. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From rays at blue-cove.com Sun Sep 7 11:23:41 2014 From: rays at blue-cove.com (RayS) Date: Sun, 07 Sep 2014 08:23:41 -0700 Subject: [Numpy-discussion] numpy whois limit exceeded Message-ID: <201409071523.s87FNfoo024957@blue-cove.com> In looking up module info for company code policy, I noticed the page http://www.networksolutions.com/whois/results.jsp?domain=numpy.org gives "WHOIS LIMIT EXCEEDED - SEE WWW.PIR.ORG/WHOIS FOR DETAILS" So the domain has been getting a lot of attention today:http://pir.org/resources/faq/ "Public Interest Registry accredited registrars who submit queries through the web-based WHOIS search mechanism are limited to 50 queries per minute. " (Oddly, WWW.PIR.ORG/WHOIS is a bad link, and pir.org's page: http://pir.org/domains/org-domain/ is also non-functional, for this Firefox at least, so they are Bcc'd) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Sun Sep 7 15:51:57 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Sun, 7 Sep 2014 19:51:57 +0000 (UTC) Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> Message-ID: <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> "James A. Bednar" wrote: > Please don't ever, ever break the sequence of numpy's random numbers! > Please! We have put a lot of effort into being able to reproduce our > published work exactly, Jup, it cannot be understated how important this is for reproducibility of published research. Thus from a scientific standpoint it is important that random numbers are not random. Some might think that it's just important that they are as "random as possible", but reproducibility is just as essential to stochastic simulations. This is also why parallel random number generators and parallel stochastic algorithms are so hard to program, because the operating systems' scheduler can easily break the reproducibility. I think we could add new generators to NumPy though, perhaps with a keyword to control the algorithm (defaulting to the current Mersenne Twister). A particular candidate I think we should consider is the DCMT, which is exceptionally good for parallel algorithms (the DCMT code is now BSD licensed, it used to be LGPL). Because of the way randomkit it written, it is very easy to plug-in different generators. Sturla From ben.root at ou.edu Sun Sep 7 15:59:23 2014 From: ben.root at ou.edu (Benjamin Root) Date: Sun, 7 Sep 2014 15:59:23 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> Message-ID: In addition to issues with reproducibility, think of all of the unit tests that would break! On Sun, Sep 7, 2014 at 3:51 PM, Sturla Molden wrote: > "James A. Bednar" wrote: > > > Please don't ever, ever break the sequence of numpy's random numbers! > > Please! We have put a lot of effort into being able to reproduce our > > published work exactly, > > Jup, it cannot be understated how important this is for reproducibility of > published research. Thus from a scientific standpoint it is important that > random numbers are not random. Some might think that it's just important > that they are as "random as possible", but reproducibility is just as > essential to stochastic simulations. This is also why parallel random > number generators and parallel stochastic algorithms are so hard to > program, because the operating systems' scheduler can easily break the > reproducibility. I think we could add new generators to NumPy though, > perhaps with a keyword to control the algorithm (defaulting to the current > Mersenne Twister). A particular candidate I think we should consider is the > DCMT, which is exceptionally good for parallel algorithms (the DCMT code is > now BSD licensed, it used to be LGPL). Because of the way randomkit it > written, it is very easy to plug-in different generators. > > Sturla > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Sun Sep 7 16:23:41 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Sun, 7 Sep 2014 20:23:41 +0000 (UTC) Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> Message-ID: <666709792431814167.348075sturla.molden-gmail.com@news.gmane.org> Benjamin Root wrote: > In addition to issues with reproducibility, think of all of the unit tests > that would break! That is a reproducibility problem :) From stefan.otte at gmail.com Mon Sep 8 09:29:08 2014 From: stefan.otte at gmail.com (Stefan Otte) Date: Mon, 8 Sep 2014 15:29:08 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab Message-ID: Hey, quite often I work with block matrices. Matlab offers the convenient notation [ a b; c d ] to stack matrices. The numpy equivalent is kinda clumsy: vstack([hstack([a,b]), hstack([c,d])]) I wrote the little function `stack` that does exactly that: stack([[a, b], [c, d]]) In my case `stack` replaced `hstack` and `vstack` almost completely. If you're interested in including it in numpy I created a pull request [1]. I'm looking forward to getting some feedback! Best, Stefan [1] https://github.com/numpy/numpy/pull/5057 From josef.pktd at gmail.com Mon Sep 8 10:00:47 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 8 Sep 2014 10:00:47 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: <666709792431814167.348075sturla.molden-gmail.com@news.gmane.org> References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <666709792431814167.348075sturla.molden-gmail.com@news.gmane.org> Message-ID: On Sun, Sep 7, 2014 at 4:23 PM, Sturla Molden wrote: > Benjamin Root wrote: > > In addition to issues with reproducibility, think of all of the unit > tests > > that would break! > > That is a reproducibility problem :) > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Related aside about reproducibility of random numbers: IMO: scipy.stats.distributions.rvs does not guarantee yet the values of the random numbers except for those that are directly produced by numpy. In contrast to numpy.random, scipy's distributions don't have unit tests for the specific values of the rvs, and the rvs code for specific distribution could still be improved in some cases, I guess Josef -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Mon Sep 8 10:41:57 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Mon, 8 Sep 2014 14:41:57 +0000 (UTC) Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab References: Message-ID: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Stefan Otte wrote: > stack([[a, b], [c, d]]) > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > If you're interested in including it in numpy I created a pull request > [1]. I'm looking forward to getting some feedback! As far as I can see, it uses hstack and vstack. But that means a and b have to have the same number of rows, c and d must have the same rumber of rows, and hstack((a,b)) and hstack((c,d)) must have the same number of columns. Thus it requires a regularity like this: AAAABB AAAABB CCCDDD CCCDDD CCCDDD CCCDDD What if we just ignore this constraint, and only require the output to be rectangular? Now we have a 'tetris game': AAAABB AAAABB CCCCBB CCCCBB CCCCDD CCCCDD or AAAABB AAAABB CCCCBB CCCCBB CCCCBB CCCCBB This should be 'stackable', yes? Or perhaps we need another stacking function for this, say numpy.tetris? And while we're at it, what about higher dimensions? should there be an ndstack function too? Sturla From josef.pktd at gmail.com Mon Sep 8 12:08:12 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 8 Sep 2014 12:08:12 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: On Mon, Sep 8, 2014 at 10:41 AM, Sturla Molden wrote: > Stefan Otte wrote: > > > stack([[a, b], [c, d]]) > > > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > > > If you're interested in including it in numpy I created a pull request > > [1]. I'm looking forward to getting some feedback! > np.asarray(np.bmat(....)) ? Josef > > As far as I can see, it uses hstack and vstack. But that means a and b have > to have the same number of rows, c and d must have the same rumber of rows, > and hstack((a,b)) and hstack((c,d)) must have the same number of columns. > > Thus it requires a regularity like this: > > AAAABB > AAAABB > CCCDDD > CCCDDD > CCCDDD > CCCDDD > > What if we just ignore this constraint, and only require the output to be > rectangular? Now we have a 'tetris game': > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCDD > CCCCDD > > or > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCBB > CCCCBB > > This should be 'stackable', yes? Or perhaps we need another stacking > function for this, say numpy.tetris? > > And while we're at it, what about higher dimensions? should there be an > ndstack function too? > > > Sturla > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Mon Sep 8 12:10:35 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Mon, 8 Sep 2014 09:10:35 -0700 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden wrote: > Stefan Otte wrote: > > > stack([[a, b], [c, d]]) > > > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > > > If you're interested in including it in numpy I created a pull request > > [1]. I'm looking forward to getting some feedback! > > As far as I can see, it uses hstack and vstack. But that means a and b have > to have the same number of rows, c and d must have the same rumber of rows, > and hstack((a,b)) and hstack((c,d)) must have the same number of columns. > > Thus it requires a regularity like this: > > AAAABB > AAAABB > CCCDDD > CCCDDD > CCCDDD > CCCDDD > > What if we just ignore this constraint, and only require the output to be > rectangular? Now we have a 'tetris game': > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCDD > CCCCDD > > or > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCBB > CCCCBB > > This should be 'stackable', yes? Or perhaps we need another stacking > function for this, say numpy.tetris? > > And while we're at it, what about higher dimensions? should there be an > ndstack function too? > This is starting to look like the second time in a row Stefan tries to extend numpy with a simple convenience function, and he gets tricked into implementing some sophisticated algorithm... For his next PR I expect nothing less than an NP-complete problem. ;-) > Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Sep 8 12:13:46 2014 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 8 Sep 2014 12:13:46 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: On 8 Sep 2014 10:42, "Sturla Molden" wrote: > > Stefan Otte wrote: > > > stack([[a, b], [c, d]]) > > > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > > > If you're interested in including it in numpy I created a pull request > > [1]. I'm looking forward to getting some feedback! > > As far as I can see, it uses hstack and vstack. But that means a and b have > to have the same number of rows, c and d must have the same rumber of rows, > and hstack((a,b)) and hstack((c,d)) must have the same number of columns. > > Thus it requires a regularity like this: > > AAAABB > AAAABB > CCCDDD > CCCDDD > CCCDDD > CCCDDD > > What if we just ignore this constraint, and only require the output to be > rectangular? Now we have a 'tetris game': > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCDD > CCCCDD > > or > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCBB > CCCCBB > > This should be 'stackable', yes? Or perhaps we need another stacking > function for this, say numpy.tetris? It's not at all obvious to me how to describe such "tetris" configurations, or interpret then unambiguously. Do you have a more detailed specification in mind? > And while we're at it, what about higher dimensions? should there be an > ndstack function too? Same comment here. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.b.poole at gmail.com Mon Sep 8 12:14:22 2014 From: t.b.poole at gmail.com (tpoole) Date: Mon, 8 Sep 2014 09:14:22 -0700 (MST) Subject: [Numpy-discussion] Weighted Covariance/correlation In-Reply-To: <1408110393.20638.16.camel@sebastian-t440> References: <1408110393.20638.16.camel@sebastian-t440> Message-ID: <1410192862477-38570.post@n7.nabble.com> Hi all, Any input to this? Last time it generated a fair bit of discussion, which I?ll summarise here. It?s currently possible to calculate a weighted average using np.average, but the corresponding functionality does not exist for (co)variance or corrcoeff calculations. In this case it?s less straightforward, and we need to worry about what type of information the weights contain. Repeat type weights are the easiest to explain. Here the variances of [x1, x2, x3] with weights [2, 1, 3] and [x1, x1, x2, x3, x3, x3] are identical. For Bessel correction the total number of samples is obtained by summing the weights. These weights do not have to be integer, and in this case the only important assumption is that their sum represents the total sample size. The second type of weights are importances or accuracies. Here the weighs represent the relative strength of contributions from each of the associated samples. Because this is a purely relative relation, there?s no concrete information about the total number of samples. This has to be obtained from the effective sample size, given by (sum(weights)^2)/sum(weights^2). I think the the clearest way of providing both options is to have a boolean switch indicating if the weights represent repeats or frequency type information. I can?t immediately see a good motivation for allowing both concurrently, and think this could cause confusion. Tom On 15 Aug 2014, at 14:46, Sebastian Berg <[hidden email]> wrote: > Hi all, > > Tom Poole has opened pull request > https://github.com/numpy/numpy/pull/4960 to implement weights into > np.cov (correlation can be added), somewhat picking up the effort > started by Noel Dawe in https://github.com/numpy/numpy/pull/3864. > > The pull request would currently implement an accuracy type `weights` > keyword argument as default, but have a switch `repeat_weights` to use > repeat type weights instead (frequency type are a special case of this I > think). > > As far as I can see, the code is in a state that it can be tested. But > since it is a new feature, the names/defaults are up for discussion, so > maybe someone who might use such a feature has a preference. I know we > had a short discussion about this before, but it was a while ago. For > example another option would be to have the two weights as two keyword > arguments, instead of a boolean switch. > > Regards, > > Sebastian > > _______________________________________________ > NumPy-Discussion mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- View this message in context: http://numpy-discussion.10968.n7.nabble.com/Weighted-Covariance-correlation-tp38394p38570.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From josef.pktd at gmail.com Mon Sep 8 12:18:25 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 8 Sep 2014 12:18:25 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: On Mon, Sep 8, 2014 at 12:10 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden > wrote: > >> Stefan Otte wrote: >> >> > stack([[a, b], [c, d]]) >> > >> > In my case `stack` replaced `hstack` and `vstack` almost completely. >> > >> > If you're interested in including it in numpy I created a pull request >> > [1]. I'm looking forward to getting some feedback! >> >> As far as I can see, it uses hstack and vstack. But that means a and b >> have >> to have the same number of rows, c and d must have the same rumber of >> rows, >> and hstack((a,b)) and hstack((c,d)) must have the same number of columns. >> >> Thus it requires a regularity like this: >> >> AAAABB >> AAAABB >> CCCDDD >> CCCDDD >> CCCDDD >> CCCDDD >> >> What if we just ignore this constraint, and only require the output to be >> rectangular? Now we have a 'tetris game': >> >> AAAABB >> AAAABB >> CCCCBB >> CCCCBB >> CCCCDD >> CCCCDD >> >> or >> >> AAAABB >> AAAABB >> CCCCBB >> CCCCBB >> CCCCBB >> CCCCBB >> >> This should be 'stackable', yes? Or perhaps we need another stacking >> function for this, say numpy.tetris? >> >> And while we're at it, what about higher dimensions? should there be an >> ndstack function too? >> > > This is starting to look like the second time in a row Stefan tries to > extend numpy with a simple convenience function, and he gets tricked into > implementing some sophisticated algorithm... > Maybe the third time is a gem. > > For his next PR I expect nothing less than an NP-complete problem. ;-) > How about cholesky or qr updating? I could use one right now. Josef > > >> Jaime > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Mon Sep 8 12:41:58 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 8 Sep 2014 18:41:58 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: Sturla: im not sure if the intention is always unambiguous, for such more flexible arrangements. Also, I doubt such situations arise often in practice; if the arrays arnt a grid, they are probably a nested grid, and the code would most naturally concatenate them with nested calls to a stacking function. However, some form of nd-stack function would be neat in my opinion. On Mon, Sep 8, 2014 at 6:10 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden > wrote: > >> Stefan Otte wrote: >> >> > stack([[a, b], [c, d]]) >> > >> > In my case `stack` replaced `hstack` and `vstack` almost completely. >> > >> > If you're interested in including it in numpy I created a pull request >> > [1]. I'm looking forward to getting some feedback! >> >> As far as I can see, it uses hstack and vstack. But that means a and b >> have >> to have the same number of rows, c and d must have the same rumber of >> rows, >> and hstack((a,b)) and hstack((c,d)) must have the same number of columns. >> >> Thus it requires a regularity like this: >> >> AAAABB >> AAAABB >> CCCDDD >> CCCDDD >> CCCDDD >> CCCDDD >> >> What if we just ignore this constraint, and only require the output to be >> rectangular? Now we have a 'tetris game': >> >> AAAABB >> AAAABB >> CCCCBB >> CCCCBB >> CCCCDD >> CCCCDD >> >> or >> >> AAAABB >> AAAABB >> CCCCBB >> CCCCBB >> CCCCBB >> CCCCBB >> >> This should be 'stackable', yes? Or perhaps we need another stacking >> function for this, say numpy.tetris? >> >> And while we're at it, what about higher dimensions? should there be an >> ndstack function too? >> > > This is starting to look like the second time in a row Stefan tries to > extend numpy with a simple convenience function, and he gets tricked into > implementing some sophisticated algorithm... > > For his next PR I expect nothing less than an NP-complete problem. ;-) > > >> Jaime > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Mon Sep 8 12:55:34 2014 From: ben.root at ou.edu (Benjamin Root) Date: Mon, 8 Sep 2014 12:55:34 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: A use case would be "image stitching" or even data tiling. I have had to implement something like this at work (so, I can't share it, unfortunately) and it even goes so far as to allow the caller to specify how much the tiles can overlap and such. The specification is ungodly hideous and I doubt I would be willing to share it even if I could lest I release code-thulu upon the world... I think just having this generalize stack feature would be nice start. Tetris could be built on top of that later. (Although, I do vote for at least 3 or 4 dimensional stacking, if possible). Cheers! Ben Root On Mon, Sep 8, 2014 at 12:41 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > Sturla: im not sure if the intention is always unambiguous, for such more > flexible arrangements. > > Also, I doubt such situations arise often in practice; if the arrays arnt > a grid, they are probably a nested grid, and the code would most naturally > concatenate them with nested calls to a stacking function. > > However, some form of nd-stack function would be neat in my opinion. > > On Mon, Sep 8, 2014 at 6:10 PM, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden >> wrote: >> >>> Stefan Otte wrote: >>> >>> > stack([[a, b], [c, d]]) >>> > >>> > In my case `stack` replaced `hstack` and `vstack` almost completely. >>> > >>> > If you're interested in including it in numpy I created a pull request >>> > [1]. I'm looking forward to getting some feedback! >>> >>> As far as I can see, it uses hstack and vstack. But that means a and b >>> have >>> to have the same number of rows, c and d must have the same rumber of >>> rows, >>> and hstack((a,b)) and hstack((c,d)) must have the same number of columns. >>> >>> Thus it requires a regularity like this: >>> >>> AAAABB >>> AAAABB >>> CCCDDD >>> CCCDDD >>> CCCDDD >>> CCCDDD >>> >>> What if we just ignore this constraint, and only require the output to be >>> rectangular? Now we have a 'tetris game': >>> >>> AAAABB >>> AAAABB >>> CCCCBB >>> CCCCBB >>> CCCCDD >>> CCCCDD >>> >>> or >>> >>> AAAABB >>> AAAABB >>> CCCCBB >>> CCCCBB >>> CCCCBB >>> CCCCBB >>> >>> This should be 'stackable', yes? Or perhaps we need another stacking >>> function for this, say numpy.tetris? >>> >>> And while we're at it, what about higher dimensions? should there be an >>> ndstack function too? >>> >> >> This is starting to look like the second time in a row Stefan tries to >> extend numpy with a simple convenience function, and he gets tricked into >> implementing some sophisticated algorithm... >> >> For his next PR I expect nothing less than an NP-complete problem. ;-) >> >> >>> Jaime >> >> -- >> (\__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes >> de dominaci?n mundial. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Mon Sep 8 13:00:11 2014 From: ben.root at ou.edu (Benjamin Root) Date: Mon, 8 Sep 2014 13:00:11 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: Btw, on a somewhat related note, whoever can implement ndarray to be able to use views from other ndarrays stitched together would get a fruit basket from me come the holidays and possibly naming rights for the next kid... Cheers! Ben Root On Mon, Sep 8, 2014 at 12:55 PM, Benjamin Root wrote: > A use case would be "image stitching" or even data tiling. I have had to > implement something like this at work (so, I can't share it, unfortunately) > and it even goes so far as to allow the caller to specify how much the > tiles can overlap and such. The specification is ungodly hideous and I > doubt I would be willing to share it even if I could lest I release > code-thulu upon the world... > > I think just having this generalize stack feature would be nice start. > Tetris could be built on top of that later. (Although, I do vote for at > least 3 or 4 dimensional stacking, if possible). > > Cheers! > Ben Root > > > On Mon, Sep 8, 2014 at 12:41 PM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> Sturla: im not sure if the intention is always unambiguous, for such more >> flexible arrangements. >> >> Also, I doubt such situations arise often in practice; if the arrays arnt >> a grid, they are probably a nested grid, and the code would most naturally >> concatenate them with nested calls to a stacking function. >> >> However, some form of nd-stack function would be neat in my opinion. >> >> On Mon, Sep 8, 2014 at 6:10 PM, Jaime Fern?ndez del R?o < >> jaime.frio at gmail.com> wrote: >> >>> On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden >>> wrote: >>> >>>> Stefan Otte wrote: >>>> >>>> > stack([[a, b], [c, d]]) >>>> > >>>> > In my case `stack` replaced `hstack` and `vstack` almost completely. >>>> > >>>> > If you're interested in including it in numpy I created a pull request >>>> > [1]. I'm looking forward to getting some feedback! >>>> >>>> As far as I can see, it uses hstack and vstack. But that means a and b >>>> have >>>> to have the same number of rows, c and d must have the same rumber of >>>> rows, >>>> and hstack((a,b)) and hstack((c,d)) must have the same number of >>>> columns. >>>> >>>> Thus it requires a regularity like this: >>>> >>>> AAAABB >>>> AAAABB >>>> CCCDDD >>>> CCCDDD >>>> CCCDDD >>>> CCCDDD >>>> >>>> What if we just ignore this constraint, and only require the output to >>>> be >>>> rectangular? Now we have a 'tetris game': >>>> >>>> AAAABB >>>> AAAABB >>>> CCCCBB >>>> CCCCBB >>>> CCCCDD >>>> CCCCDD >>>> >>>> or >>>> >>>> AAAABB >>>> AAAABB >>>> CCCCBB >>>> CCCCBB >>>> CCCCBB >>>> CCCCBB >>>> >>>> This should be 'stackable', yes? Or perhaps we need another stacking >>>> function for this, say numpy.tetris? >>>> >>>> And while we're at it, what about higher dimensions? should there be an >>>> ndstack function too? >>>> >>> >>> This is starting to look like the second time in a row Stefan tries to >>> extend numpy with a simple convenience function, and he gets tricked into >>> implementing some sophisticated algorithm... >>> >>> For his next PR I expect nothing less than an NP-complete problem. ;-) >>> >>> >>>> Jaime >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>> planes de dominaci?n mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noel.pierre.andre at gmail.com Mon Sep 8 13:05:03 2014 From: noel.pierre.andre at gmail.com (Pierre-Andre Noel) Date: Mon, 08 Sep 2014 10:05:03 -0700 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> Message-ID: <540DE1BF.4030405@gmail.com> > I think we could add new generators to NumPy though, > perhaps with a keyword to control the algorithm (defaulting to the current > Mersenne Twister). Why not do something like the C++11 ? In , a "generator" is the engine producing randomness, and a "distribution" decides what is the type of outputs that you want. Here is the example on http://www.cplusplus.com/reference/random/ . std::default_random_engine generator; std::uniform_int_distribution distribution(1,6); int dice_roll = distribution(generator); // generates number in the range 1..6 For convenience, you can bind the generator with the distribution (still from the web page above). auto dice = std::bind(distribution, generator); int wisdom = dice()+dice()+dice(); Here is how I propose to adapt this scheme to numpy. First, there would be a global generator defaulting to the current implementation of Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and "set_state" would affect this global generator. All numpy functions associated to random number generation (i.e., everything listed on http://docs.scipy.org/doc/numpy/reference/routines.random.html except for "RandomState", "seed", "get_state" and "set_state") would accept the kwarg "generator", which defaults to the global generator (by default the current Mersene Twister). Now there could be other generator objects: the new Mersene Twister, some lightweight-but-worse generator, or some cryptographically-safe random generator. Each such generator would have "RandomState", "seed", "get_state" and "set_state" methods (except perhaps the criptographically-safe ones). When calling a numpy function with generator=my_generator, that function uses this generator instead the global one. Moreover, there would be be a function, say select_default_random_engine(generator), which changes the global generator to a user-specified one. > This is also why parallel random > number generators and parallel stochastic algorithms are so hard to > program, because the operating systems' scheduler can easily break the > reproducibility. What I propose would also simplify this: each thread can use its own independently-seeded generator. Timing is no longer a problem: as long as which-thread-does-what is not affected by the scheduler, the execution remains deterministic. On 09/07/2014 12:51 PM, Sturla Molden wrote: > "James A. Bednar" wrote: > >> Please don't ever, ever break the sequence of numpy's random numbers! >> Please! We have put a lot of effort into being able to reproduce our >> published work exactly, > Jup, it cannot be understated how important this is for reproducibility of > published research. Thus from a scientific standpoint it is important that > random numbers are not random. Some might think that it's just important > that they are as "random as possible", but reproducibility is just as > essential to stochastic simulations. This is also why parallel random > number generators and parallel stochastic algorithms are so hard to > program, because the operating systems' scheduler can easily break the > reproducibility. I think we could add new generators to NumPy though, > perhaps with a keyword to control the algorithm (defaulting to the current > Mersenne Twister). A particular candidate I think we should consider is the > DCMT, which is exceptionally good for parallel algorithms (the DCMT code is > now BSD licensed, it used to be LGPL). Because of the way randomkit it > written, it is very easy to plug-in different generators. > > Sturla > > From shoyer at gmail.com Mon Sep 8 13:18:45 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 8 Sep 2014 10:18:45 -0700 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: On Mon, Sep 8, 2014 at 10:00 AM, Benjamin Root wrote: > Btw, on a somewhat related note, whoever can implement ndarray to be able > to use views from other ndarrays stitched together would get a fruit basket > from me come the holidays and possibly naming rights for the next kid... > Ben, you should check out Biggus, which does (at least some) of what you're describing: https://github.com/SciTools/biggus Two things I would like to see before this makes it into numpy: (1) It should handle arbitrary dimensional arrays, not just 2D. (2) It needs to be more efficient. Composing vstack and hstack directly makes a whole level of unnecessary intermediate copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Mon Sep 8 13:37:31 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 8 Sep 2014 19:37:31 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Blockmatrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: <540de947.a50cb40a.5945.0e25@mx.google.com> Blaze aims to do something like that; to make the notion of an array and how it stores it's data far more flexible. But if it isn't a single strided ND array, it isn't numpy. This concept lies at its very heart; and for good reasons I would add. -----Original Message----- From: "Benjamin Root" Sent: ?8-?9-?2014 19:00 To: "Discussion of Numerical Python" Subject: Re: [Numpy-discussion] Generalize hstack/vstack --> stack; Blockmatrices like in matlab Btw, on a somewhat related note, whoever can implement ndarray to be able to use views from other ndarrays stitched together would get a fruit basket from me come the holidays and possibly naming rights for the next kid... Cheers! Ben Root On Mon, Sep 8, 2014 at 12:55 PM, Benjamin Root wrote: A use case would be "image stitching" or even data tiling. I have had to implement something like this at work (so, I can't share it, unfortunately) and it even goes so far as to allow the caller to specify how much the tiles can overlap and such. The specification is ungodly hideous and I doubt I would be willing to share it even if I could lest I release code-thulu upon the world... I think just having this generalize stack feature would be nice start. Tetris could be built on top of that later. (Although, I do vote for at least 3 or 4 dimensional stacking, if possible). Cheers! Ben Root On Mon, Sep 8, 2014 at 12:41 PM, Eelco Hoogendoorn wrote: Sturla: im not sure if the intention is always unambiguous, for such more flexible arrangements. Also, I doubt such situations arise often in practice; if the arrays arnt a grid, they are probably a nested grid, and the code would most naturally concatenate them with nested calls to a stacking function. However, some form of nd-stack function would be neat in my opinion. On Mon, Sep 8, 2014 at 6:10 PM, Jaime Fern?ndez del R?o wrote: On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden wrote: Stefan Otte wrote: > stack([[a, b], [c, d]]) > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > If you're interested in including it in numpy I created a pull request > [1]. I'm looking forward to getting some feedback! As far as I can see, it uses hstack and vstack. But that means a and b have to have the same number of rows, c and d must have the same rumber of rows, and hstack((a,b)) and hstack((c,d)) must have the same number of columns. Thus it requires a regularity like this: AAAABB AAAABB CCCDDD CCCDDD CCCDDD CCCDDD What if we just ignore this constraint, and only require the output to be rectangular? Now we have a 'tetris game': AAAABB AAAABB CCCCBB CCCCBB CCCCDD CCCCDD or AAAABB AAAABB CCCCBB CCCCBB CCCCBB CCCCBB This should be 'stackable', yes? Or perhaps we need another stacking function for this, say numpy.tetris? And while we're at it, what about higher dimensions? should there be an ndstack function too? This is starting to look like the second time in a row Stefan tries to extend numpy with a simple convenience function, and he gets tricked into implementing some sophisticated algorithm... For his next PR I expect nothing less than an NP-complete problem. ;-) Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Mon Sep 8 14:00:27 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Mon, 08 Sep 2014 20:00:27 +0200 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: <540DE1BF.4030405@gmail.com> References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> Message-ID: <540DEEBB.5090803@googlemail.com> On 08.09.2014 19:05, Pierre-Andre Noel wrote: > > I think we could add new generators to NumPy though, > > perhaps with a keyword to control the algorithm (defaulting to the > current > > Mersenne Twister). > ... > > Here is how I propose to adapt this scheme to numpy. First, there would > be a global generator defaulting to the current implementation of > Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and > "set_state" would affect this global generator. > > All numpy functions associated to random number generation (i.e., > everything listed on > http://docs.scipy.org/doc/numpy/reference/routines.random.html except > for "RandomState", "seed", "get_state" and "set_state") would accept the > kwarg "generator", which defaults to the global generator (by default > the current Mersene Twister). > I don't think every function would need a generator argument, for real world applications it should be sufficient to have the state object carry which generator is used and maybe a switch to change the global one. But as already mentioned by Robert, we know what we can do, what is missing is someone writting the code. From robert.kern at gmail.com Mon Sep 8 14:43:34 2014 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 8 Sep 2014 19:43:34 +0100 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: <540DE1BF.4030405@gmail.com> References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> Message-ID: On Mon, Sep 8, 2014 at 6:05 PM, Pierre-Andre Noel wrote: > > I think we could add new generators to NumPy though, > > perhaps with a keyword to control the algorithm (defaulting to the > current > > Mersenne Twister). > > Why not do something like the C++11 ? In , a "generator" > is the engine producing randomness, and a "distribution" decides what is > the type of outputs that you want. Here is the example on > http://www.cplusplus.com/reference/random/ . > > std::default_random_engine generator; > std::uniform_int_distribution distribution(1,6); > int dice_roll = distribution(generator); // generates number in > the range 1..6 > > For convenience, you can bind the generator with the distribution (still > from the web page above). > > auto dice = std::bind(distribution, generator); > int wisdom = dice()+dice()+dice(); > > Here is how I propose to adapt this scheme to numpy. First, there would > be a global generator defaulting to the current implementation of > Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and > "set_state" would affect this global generator. > > All numpy functions associated to random number generation (i.e., > everything listed on > http://docs.scipy.org/doc/numpy/reference/routines.random.html except > for "RandomState", "seed", "get_state" and "set_state") would accept the > kwarg "generator", which defaults to the global generator (by default > the current Mersene Twister). > > Now there could be other generator objects: the new Mersene Twister, > some lightweight-but-worse generator, or some cryptographically-safe > random generator. Each such generator would have "RandomState", "seed", > "get_state" and "set_state" methods (except perhaps the > criptographically-safe ones). When calling a numpy function with > generator=my_generator, that function uses this generator instead the > global one. Moreover, there would be be a function, say > select_default_random_engine(generator), which changes the global > generator to a user-specified one. I think the Python standard library's example is more instructive. We have new classes for each new core uniform generator. They will share a common superclass to share the implementation of the non-uniform distributions. numpy.random.RandomState will continue to be the current Mersenne Twister implementation, and so will the underlying global RandomState for all of the convenience functions in numpy.random. If you want the SFMT variant, you instantiate numpy.random.SFMT() and call its methods directly. -- Robert Kern From joseph.martinot-lagarde at m4x.org Mon Sep 8 16:39:49 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Mon, 08 Sep 2014 22:39:49 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: <540E1415.5040600@m4x.org> Le 08/09/2014 16:41, Sturla Molden a ?crit : > Stefan Otte wrote: > >> stack([[a, b], [c, d]]) >> >> In my case `stack` replaced `hstack` and `vstack` almost completely. >> >> If you're interested in including it in numpy I created a pull request >> [1]. I'm looking forward to getting some feedback! > > As far as I can see, it uses hstack and vstack. But that means a and b have > to have the same number of rows, c and d must have the same rumber of rows, > and hstack((a,b)) and hstack((c,d)) must have the same number of columns. > > Thus it requires a regularity like this: > > AAAABB > AAAABB > CCCDDD > CCCDDD > CCCDDD > CCCDDD > > What if we just ignore this constraint, and only require the output to be > rectangular? Now we have a 'tetris game': > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCDD > CCCCDD > > or > > AAAABB > AAAABB > CCCCBB > CCCCBB > CCCCBB > CCCCBB stack([stack([[a], [c]]), b]) > > This should be 'stackable', yes? Or perhaps we need another stacking > function for this, say numpy.tetris? The function should be implemented for its name only ! I like it ! > > And while we're at it, what about higher dimensions? should there be an > ndstack function too? > > > Sturla > From joseph.martinot-lagarde at m4x.org Mon Sep 8 16:40:39 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Mon, 08 Sep 2014 22:40:39 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: Message-ID: <540E1447.4040009@m4x.org> Le 08/09/2014 15:29, Stefan Otte a ?crit : > Hey, > > quite often I work with block matrices. Matlab offers the convenient notation > > [ a b; c d ] > > to stack matrices. The numpy equivalent is kinda clumsy: > > vstack([hstack([a,b]), hstack([c,d])]) > > I wrote the little function `stack` that does exactly that: > > stack([[a, b], [c, d]]) > > In my case `stack` replaced `hstack` and `vstack` almost completely. > > If you're interested in including it in numpy I created a pull request > [1]. I'm looking forward to getting some feedback! > > > Best, > Stefan > > > > [1] https://github.com/numpy/numpy/pull/5057 > The outside brackets are redundant, stack([[a, b], [c, d]]) should be stack([a, b], [c, d]) From cjw at ncf.ca Mon Sep 8 18:14:24 2014 From: cjw at ncf.ca (cjw) Date: Mon, 08 Sep 2014 18:14:24 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <540E1447.4040009@m4x.org> References: <540E1447.4040009@m4x.org> Message-ID: <540E2A40.70208@ncf.ca> On 08-Sep-14 4:40 PM, Joseph Martinot-Lagarde wrote: > Le 08/09/2014 15:29, Stefan Otte a ?crit : >> Hey, >> >> quite often I work with block matrices. Matlab offers the convenient notation >> >> [ a b; c d ] This would appear to be a desirable way to go. Numpy has something similar for strings. The above is neater. Colin W. >> to stack matrices. The numpy equivalent is kinda clumsy: >> >> vstack([hstack([a,b]), hstack([c,d])]) >> >> I wrote the little function `stack` that does exactly that: >> >> stack([[a, b], [c, d]]) >> >> In my case `stack` replaced `hstack` and `vstack` almost completely. >> >> If you're interested in including it in numpy I created a pull request >> [1]. I'm looking forward to getting some feedback! >> >> >> Best, >> Stefan >> >> >> >> [1] https://github.com/numpy/numpy/pull/5057 >> > The outside brackets are redundant, stack([[a, b], [c, d]]) should be > stack([a, b], [c, d]) > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From stefan.otte at gmail.com Tue Sep 9 05:42:38 2014 From: stefan.otte at gmail.com (Stefan Otte) Date: Tue, 9 Sep 2014 11:42:38 +0200 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: <540E2A40.70208@ncf.ca> References: <540E1447.4040009@m4x.org> <540E2A40.70208@ncf.ca> Message-ID: Hey, @Josef, I wasn't aware of `bmat` and `np.asarray(np.bmat(....))` does basically what I want and what I'm already using. Regarding the Tetris problem: that never happened to me, but stack, as Josef pointed out, can handle that already :) I like the idea of removing the redundant square brackets: stack([[a, b], [c, d]]) --> stack([a, b], [c, d]) However, if the brackets are there there is no difference between creating a `np.array` and stacking arrays with `np.stack`. If we want to get fancy and turn this PR into something bigger (working our way up to a NP-complete problem ;)) then how about this. I sometimes have arrays that look like: AB0 0 C Where 0 is a scalar but is supposed to fill the rest of the array. Having something like 0 in there might lead to ambiguities though. What does ABC 0D0 mean? One could limit the "filler" to appear only on the left or the right: AB0 0CD But even then the shape is not completely determined. So we could require to have one row that only consists of arrays and determines the shape. Alternatively we could have a keyword parameter `shape`: stack([A, B, 0], [0, C, D], shape=(8, 8)) Colin, with `bmat` you can do what you're asking for. Directly taken from the example: >>> np.bmat('A,B; C,D') matrix([[1, 1, 2, 2], [1, 1, 2, 2], [3, 4, 7, 8], [5, 6, 9, 0]]) General question: If `bmat` already offers something like `stack` should we even bother implementing `stack`? More code leads to more bugs and maintenance work. Best, Stefan On Tue, Sep 9, 2014 at 12:14 AM, cjw wrote: > > On 08-Sep-14 4:40 PM, Joseph Martinot-Lagarde wrote: >> Le 08/09/2014 15:29, Stefan Otte a ?crit : >>> Hey, >>> >>> quite often I work with block matrices. Matlab offers the convenient notation >>> >>> [ a b; c d ] > This would appear to be a desirable way to go. > > Numpy has something similar for strings. The above is neater. > > Colin W. >>> to stack matrices. The numpy equivalent is kinda clumsy: >>> >>> vstack([hstack([a,b]), hstack([c,d])]) >>> >>> I wrote the little function `stack` that does exactly that: >>> >>> stack([[a, b], [c, d]]) >>> >>> In my case `stack` replaced `hstack` and `vstack` almost completely. >>> >>> If you're interested in including it in numpy I created a pull request >>> [1]. I'm looking forward to getting some feedback! >>> >>> >>> Best, >>> Stefan >>> >>> >>> >>> [1] https://github.com/numpy/numpy/pull/5057 >>> >> The outside brackets are redundant, stack([[a, b], [c, d]]) should be >> stack([a, b], [c, d]) >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Tue Sep 9 08:30:13 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 9 Sep 2014 08:30:13 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <540E1447.4040009@m4x.org> <540E2A40.70208@ncf.ca> Message-ID: On Tue, Sep 9, 2014 at 5:42 AM, Stefan Otte wrote: > Hey, > > @Josef, I wasn't aware of `bmat` and `np.asarray(np.bmat(....))` does > basically what I want and what I'm already using. > I never needed any tetris or anything similar except for the matched block version. Just to point out two more related functions scipy.sparse also has `bmat` for sparse block matrices scipy.linalg and scipy.sparse have `block_diag` to complement bmat. What I sometimes wish for is a sparse pseudo kronecker product as convenience to bmat, where the first (or the second) matrix contains (0,1) flags where the 1's specify where to put the blocks. (I'm not sure what I really mean, similar to block_diag but with a different filling pattern.) Josef > > Regarding the Tetris problem: that never happened to me, but stack, as > Josef pointed out, can handle that already :) > > I like the idea of removing the redundant square brackets: > stack([[a, b], [c, d]]) --> stack([a, b], [c, d]) > However, if the brackets are there there is no difference between > creating a `np.array` and stacking arrays with `np.stack`. > > If we want to get fancy and turn this PR into something bigger > (working our way up to a NP-complete problem ;)) then how about this. > I sometimes have arrays that look like: > AB0 > 0 C > Where 0 is a scalar but is supposed to fill the rest of the array. > Having something like 0 in there might lead to ambiguities though. What > does > ABC > 0D0 > mean? One could limit the "filler" to appear only on the left or the right: > AB0 > 0CD > But even then the shape is not completely determined. So we could > require to have one row that only consists of arrays and determines > the shape. Alternatively we could have a keyword parameter `shape`: > stack([A, B, 0], [0, C, D], shape=(8, 8)) > > Colin, with `bmat` you can do what you're asking for. Directly taken > from the example: > >>> np.bmat('A,B; C,D') > matrix([[1, 1, 2, 2], > [1, 1, 2, 2], > [3, 4, 7, 8], > [5, 6, 9, 0]]) > > > General question: If `bmat` already offers something like `stack` > should we even bother implementing `stack`? More code leads to more > bugs and maintenance work. > > > Best, > Stefan > > > > On Tue, Sep 9, 2014 at 12:14 AM, cjw wrote: > > > > On 08-Sep-14 4:40 PM, Joseph Martinot-Lagarde wrote: > >> Le 08/09/2014 15:29, Stefan Otte a ?crit : > >>> Hey, > >>> > >>> quite often I work with block matrices. Matlab offers the convenient > notation > >>> > >>> [ a b; c d ] > > This would appear to be a desirable way to go. > > > > Numpy has something similar for strings. The above is neater. > > > > Colin W. > >>> to stack matrices. The numpy equivalent is kinda clumsy: > >>> > >>> vstack([hstack([a,b]), hstack([c,d])]) > >>> > >>> I wrote the little function `stack` that does exactly that: > >>> > >>> stack([[a, b], [c, d]]) > >>> > >>> In my case `stack` replaced `hstack` and `vstack` almost completely. > >>> > >>> If you're interested in including it in numpy I created a pull request > >>> [1]. I'm looking forward to getting some feedback! > >>> > >>> > >>> Best, > >>> Stefan > >>> > >>> > >>> > >>> [1] https://github.com/numpy/numpy/pull/5057 > >>> > >> The outside brackets are redundant, stack([[a, b], [c, d]]) should be > >> stack([a, b], [c, d]) > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Tue Sep 9 08:42:27 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 9 Sep 2014 08:42:27 -0400 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <540E1447.4040009@m4x.org> <540E2A40.70208@ncf.ca> Message-ID: On Tue, Sep 9, 2014 at 8:30 AM, wrote: > > > > On Tue, Sep 9, 2014 at 5:42 AM, Stefan Otte wrote: > >> Hey, >> >> @Josef, I wasn't aware of `bmat` and `np.asarray(np.bmat(....))` does >> basically what I want and what I'm already using. >> > > I never needed any tetris or anything similar except for the matched block > version. > > Just to point out two more related functions > > scipy.sparse also has `bmat` for sparse block matrices > scipy.linalg and scipy.sparse have `block_diag` to complement bmat. > > > What I sometimes wish for is a sparse pseudo kronecker product as > convenience to bmat, where the first (or the second) matrix contains (0,1) > flags where the 1's specify where to put the blocks. > (I'm not sure what I really mean, similar to block_diag but with a > different filling pattern.) > (or in analogy to kronecker, np.nonzero provides the filling pattern, but the submatrix is multiplied by the value.) (application: unbalanced repeated measures or panel data) Josef (....) > > Josef > > > >> >> Regarding the Tetris problem: that never happened to me, but stack, as >> Josef pointed out, can handle that already :) >> >> I like the idea of removing the redundant square brackets: >> stack([[a, b], [c, d]]) --> stack([a, b], [c, d]) >> However, if the brackets are there there is no difference between >> creating a `np.array` and stacking arrays with `np.stack`. >> >> If we want to get fancy and turn this PR into something bigger >> (working our way up to a NP-complete problem ;)) then how about this. >> I sometimes have arrays that look like: >> AB0 >> 0 C >> Where 0 is a scalar but is supposed to fill the rest of the array. >> Having something like 0 in there might lead to ambiguities though. What >> does >> ABC >> 0D0 >> mean? One could limit the "filler" to appear only on the left or the >> right: >> AB0 >> 0CD >> But even then the shape is not completely determined. So we could >> require to have one row that only consists of arrays and determines >> the shape. Alternatively we could have a keyword parameter `shape`: >> stack([A, B, 0], [0, C, D], shape=(8, 8)) >> >> Colin, with `bmat` you can do what you're asking for. Directly taken >> from the example: >> >>> np.bmat('A,B; C,D') >> matrix([[1, 1, 2, 2], >> [1, 1, 2, 2], >> [3, 4, 7, 8], >> [5, 6, 9, 0]]) >> >> >> General question: If `bmat` already offers something like `stack` >> should we even bother implementing `stack`? More code leads to more >> bugs and maintenance work. >> >> >> Best, >> Stefan >> >> >> >> On Tue, Sep 9, 2014 at 12:14 AM, cjw wrote: >> > >> > On 08-Sep-14 4:40 PM, Joseph Martinot-Lagarde wrote: >> >> Le 08/09/2014 15:29, Stefan Otte a ?crit : >> >>> Hey, >> >>> >> >>> quite often I work with block matrices. Matlab offers the convenient >> notation >> >>> >> >>> [ a b; c d ] >> > This would appear to be a desirable way to go. >> > >> > Numpy has something similar for strings. The above is neater. >> > >> > Colin W. >> >>> to stack matrices. The numpy equivalent is kinda clumsy: >> >>> >> >>> vstack([hstack([a,b]), hstack([c,d])]) >> >>> >> >>> I wrote the little function `stack` that does exactly that: >> >>> >> >>> stack([[a, b], [c, d]]) >> >>> >> >>> In my case `stack` replaced `hstack` and `vstack` almost completely. >> >>> >> >>> If you're interested in including it in numpy I created a pull request >> >>> [1]. I'm looking forward to getting some feedback! >> >>> >> >>> >> >>> Best, >> >>> Stefan >> >>> >> >>> >> >>> >> >>> [1] https://github.com/numpy/numpy/pull/5057 >> >>> >> >> The outside brackets are redundant, stack([[a, b], [c, d]]) should be >> >> stack([a, b], [c, d]) >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion at scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas_unterthiner at web.de Tue Sep 9 11:23:38 2014 From: thomas_unterthiner at web.de (Thomas Unterthiner) Date: Tue, 09 Sep 2014 17:23:38 +0200 Subject: [Numpy-discussion] numpy ignores OPT/FOPT under Python3 Message-ID: <540F1B7A.8030108@web.de> Hi! I want to use the OPT/FOPT environment viariables to set compiler flags when compiling numpy. However it seems that they get ignored under python3. Using Ubuntu 14.04 and numpy 1.9.0, I did the following: >export OPT="-march=native" >export FOPT = "-march=native" > python setup.py build # "python" executes python2.7 [...snip...] C compiler: x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -march=native -fPIC ^C obviously under python 2.7, the additional march flag works as expected. However, using python3: >export OPT="-march=native" >export FOPT = "-march=native" >python3 setup.py build [.... snip ...] C compiler: x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC ^C obviously, the flags aren't used in python 3. Did I overlook something here? Do $OPT/$FOPT only work in Python 2.7 by design, is this a bug or did I miss something? Cheers Thomas From njs at pobox.com Tue Sep 9 14:08:52 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 9 Sep 2014 14:08:52 -0400 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> Message-ID: On 8 Sep 2014 14:43, "Robert Kern" wrote: > > On Mon, Sep 8, 2014 at 6:05 PM, Pierre-Andre Noel > wrote: > > > I think we could add new generators to NumPy though, > > > perhaps with a keyword to control the algorithm (defaulting to the > > current > > > Mersenne Twister). > > > > Why not do something like the C++11 ? In , a "generator" > > is the engine producing randomness, and a "distribution" decides what is > > the type of outputs that you want. Here is the example on > > http://www.cplusplus.com/reference/random/ . > > > > std::default_random_engine generator; > > std::uniform_int_distribution distribution(1,6); > > int dice_roll = distribution(generator); // generates number in > > the range 1..6 > > > > For convenience, you can bind the generator with the distribution (still > > from the web page above). > > > > auto dice = std::bind(distribution, generator); > > int wisdom = dice()+dice()+dice(); > > > > Here is how I propose to adapt this scheme to numpy. First, there would > > be a global generator defaulting to the current implementation of > > Mersene Twister. Calls to numpy's "RandomState", "seed", "get_state" and > > "set_state" would affect this global generator. > > > > All numpy functions associated to random number generation (i.e., > > everything listed on > > http://docs.scipy.org/doc/numpy/reference/routines.random.html except > > for "RandomState", "seed", "get_state" and "set_state") would accept the > > kwarg "generator", which defaults to the global generator (by default > > the current Mersene Twister). > > > > Now there could be other generator objects: the new Mersene Twister, > > some lightweight-but-worse generator, or some cryptographically-safe > > random generator. Each such generator would have "RandomState", "seed", > > "get_state" and "set_state" methods (except perhaps the > > criptographically-safe ones). When calling a numpy function with > > generator=my_generator, that function uses this generator instead the > > global one. Moreover, there would be be a function, say > > select_default_random_engine(generator), which changes the global > > generator to a user-specified one. > > I think the Python standard library's example is more instructive. We > have new classes for each new core uniform generator. They will share > a common superclass to share the implementation of the non-uniform > distributions. numpy.random.RandomState will continue to be the > current Mersenne Twister implementation, and so will the underlying > global RandomState for all of the convenience functions in > numpy.random. If you want the SFMT variant, you instantiate > numpy.random.SFMT() and call its methods directly. There's also another reason why generator decisions should be part of the RandomState object itself: we may want to change the distribution methods themselves over time (e.g., people have been complaining for a while that we use a suboptimal method for generating gaussian deviates), but changing these creates similar backcompat problems. So we need a way to say "please give me samples using the old gaussian implementation" or whatever. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Tue Sep 9 14:30:28 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Tue, 09 Sep 2014 20:30:28 +0200 Subject: [Numpy-discussion] SFMT (faster mersenne twister) In-Reply-To: References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> Message-ID: On 09/09/14 20:08, Nathaniel Smith wrote: > There's also another reason why generator decisions should be part of > the RandomState object itself: we may want to change the distribution > methods themselves over time (e.g., people have been complaining for a > while that we use a suboptimal method for generating gaussian deviates), > but changing these creates similar backcompat problems. Which one should we rather use? Ziggurat? Sturla From charlesr.harris at gmail.com Tue Sep 9 15:52:19 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 9 Sep 2014 13:52:19 -0600 Subject: [Numpy-discussion] @ operator Message-ID: Hi All, I'm in the midst of implementing the '@' operator (PEP 465), and there are some behaviors that are unspecified by the PEP. 1. Should the operator accept array_like for one of the arguments? 2. Does it need to handle __numpy_ufunc__, or will __array_priority__ serve? 3. Do we want PyArray_Matmul in the numpy API? 4. Should a matmul function be supplied by the multiarray module? If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, or will __array_priority__ serve? Note that the type number operators, __add__ and such, currently use __numpy_ufunc__ in combination with __array_priority__, this in addition to the fact that they are by default using ufuncs that do the same. I'd rather that the __*__ operators simply rely on __array_priority__. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Sep 9 16:03:23 2014 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 9 Sep 2014 21:03:23 +0100 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Tue, Sep 9, 2014 at 8:52 PM, Charles R Harris wrote: > Hi All, > > I'm in the midst of implementing the '@' operator (PEP 465), and there are > some behaviors that are unspecified by the PEP. > > Should the operator accept array_like for one of the arguments? I would be mildly disappointed if it didn't. > Does it need to handle __numpy_ufunc__, or will __array_priority__ serve? Not sure (TBH, I don't remember what __numpy_ufunc__ does off-hand and don't feel bothered enough to look it up). > Do we want PyArray_Matmul in the numpy API? Probably. > Should a matmul function be supplied by the multiarray module? Yes, please. It's rules are a little different than dot()'s, so we should have a function that does it. -- Robert Kern From davidmenhur at gmail.com Wed Sep 10 03:12:39 2014 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Wed, 10 Sep 2014 09:12:39 +0200 Subject: [Numpy-discussion] numpy ignores OPT/FOPT under Python3 In-Reply-To: <540F1B7A.8030108@web.de> References: <540F1B7A.8030108@web.de> Message-ID: On 9 September 2014 17:23, Thomas Unterthiner wrote: > > I want to use the OPT/FOPT environment viariables to set compiler flags > when compiling numpy. However it seems that they get ignored under > python3. Using Ubuntu 14.04 and numpy 1.9.0, I did the following: > > >export OPT="-march=native" > >export FOPT = "-march=native" > > python setup.py build # "python" executes python2.7 > [...snip...] > C compiler: x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing > -march=native -fPIC > ^C Running the same on my computer (Fedora 20, python 2.7) doesn't seem to process the flags: C compiler: -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC Double-checking: $ echo $OPT -march=native -------------- next part -------------- An HTML attachment was scrubbed... URL: From bryanv at continuum.io Wed Sep 10 09:05:08 2014 From: bryanv at continuum.io (Bryan Van de Ven) Date: Wed, 10 Sep 2014 08:05:08 -0500 Subject: [Numpy-discussion] ANN: Bokeh 0.6 release Message-ID: <88D1F25A-B655-40CF-B0A9-B7CF3B225DED@continuum.io> On behalf of the Bokeh team, I am very happy to announce the release of Bokeh version 0.6! Bokeh is a Python library for visualizing large and realtime datasets on the web. Its goal is to provide to developers (and domain experts) with capabilities to easily create novel and powerful visualizations that extract insight from local or remote (possibly large) data sets, and to easily publish those visualization to the web for others to explore and interact with. This release includes many bug fixes and improvements over our most recent 0.5.2 release: * Abstract Rendering recipes for large data sets: isocontour, heatmap * New charts in bokeh.charts: Time Series and Categorical Heatmap * Full Python 3 support for bokeh-server * Much expanded User and Dev Guides * Multiple axes and ranges capability * Plot object graph query interface * Hit-testing (hover tool support) for patch glyphs See the CHANGELOG for full details. I'd also like to announce a new Github Organization for Bokeh: https://github.com/bokeh. Currently it is home to Scala and and Julia language bindings for Bokeh, but the Bokeh project itself will be moved there before the next 0.7 release. Any implementors of new language bindings who are interested in hosting your project under this organization are encouraged to contact us. In upcoming releases, you should expect to see more new layout capabilities (colorbar axes, better grid plots and improved annotations), additional tools, even more widgets and more charts, R language bindings, Blaze integration and cloud hosting for Bokeh apps. Don't forget to check out the full documentation, interactive gallery, and tutorial at http://bokeh.pydata.org as well as the Bokeh IPython notebook nbviewer index (including all the tutorials) at: http://nbviewer.ipython.org/github/ContinuumIO/bokeh-notebooks/blob/master/index.ipynb If you are using Anaconda, you can install with conda: conda install bokeh Alternatively, you can install with pip: pip install bokeh BokehJS is also available by CDN for use in standalone javascript applications: http://cdn.pydata.org/bokeh-0.6.min.js http://cdn.pydata.org/bokeh-0.6.min.css Issues, enhancement requests, and pull requests can be made on the Bokeh Github page: https://github.com/continuumio/bokeh Questions can be directed to the Bokeh mailing list: bokeh at continuum.io If you have interest in helping to develop Bokeh, please get involved! Thanks, Bryan Van de Ven Continuum Analytics http://continuum.io From sebastian at sipsolutions.net Wed Sep 10 11:11:49 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 10 Sep 2014 17:11:49 +0200 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: <1410361909.9047.17.camel@sebastian-t440> On Di, 2014-09-09 at 13:52 -0600, Charles R Harris wrote: > Hi All, > > > I'm in the midst of implementing the '@' operator (PEP 465), and there > are some behaviors that are unspecified by the PEP. > > 1. Should the operator accept array_like for one of the > arguments? To be in line with all the other operators, I would say yes > 1. Does it need to handle __numpy_ufunc__, or will > __array_priority__ serve? > 2. Do we want PyArray_Matmul in the numpy API? Don't care much either way, but I would say yes (why not). > 1. Should a matmul function be supplied by the multiarray module? > We could possibly put it to the linalg module. But I think we should provide such a function. > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, > or will __array_priority__ serve? > > Note that the type number operators, __add__ and such, currently use > __numpy_ufunc__ in combination with __array_priority__, this in > addition to the fact that they are by default using ufuncs that do the > same. I'd rather that the __*__ operators simply rely on > __array_priority__. > Hmmm, that is a difficult one. For the operators I agree, numpy_ufunc should not be necessary, since we have NotImplemented there, and the array priority does nothing except tell numpy to stop handling all array likes (i.e. other array-likes could actually say: you have an array-priority > 0 -> return NotImplemented). So yeah, why do the operators use numpy_ufunc at all? (unless due to implementation) If we have a function numpy_ufunc would probably make sense, since that circumvents the python dispatch mechanism. - Sebastian > > Thoughts? > > Chuck > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From sturla.molden at gmail.com Wed Sep 10 11:59:28 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Wed, 10 Sep 2014 15:59:28 +0000 (UTC) Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> Message-ID: <917025199431961103.174047sturla.molden-gmail.com@news.gmane.org> Pierre-Andre Noel wrote: > Why not do something like the C++11 ? In , a "generator" > is the engine producing randomness, and a "distribution" decides what is > the type of outputs that you want. This is what randomkit is doing internally, which is why it is so easy to plug in a different generator. Sturla From sturla.molden at gmail.com Wed Sep 10 11:59:27 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Wed, 10 Sep 2014 15:59:27 +0000 (UTC) Subject: [Numpy-discussion] SFMT (faster mersenne twister) References: <21515.18012.951610.781484@hebb.inf.ed.ac.uk> <834530714431811490.661154sturla.molden-gmail.com@news.gmane.org> <540DE1BF.4030405@gmail.com> <540DEEBB.5090803@googlemail.com> Message-ID: <261359863431961370.090358sturla.molden-gmail.com@news.gmane.org> Julian Taylor wrote: > But as already mentioned by Robert, we know what we can do, what is > missing is someone writting the code. This is actually a part of NumPy I know in detail, so I will be able to contribute. Robert Kern's last post about objects like np.random.SFMT() working similar to RandomState should be doable and not break any backwards compatibility. Sturla From pav at iki.fi Wed Sep 10 12:52:16 2014 From: pav at iki.fi (Pauli Virtanen) Date: Wed, 10 Sep 2014 19:52:16 +0300 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: 09.09.2014, 22:52, Charles R Harris kirjoitti: > 1. Should the operator accept array_like for one of the arguments? > 2. Does it need to handle __numpy_ufunc__, or will > __array_priority__ serve? I think the __matmul__ operator implementation should follow that of __mul__. [clip] > 3. Do we want PyArray_Matmul in the numpy API? > 4. Should a matmul function be supplied by the multiarray module? > > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, or > will __array_priority__ serve? dot() function deals with __numpy_ufunc__, and the matmul() function should behave similarly. It seems dot() uses __array_priority__ for selection of output return subclass, so matmul() probably needs do the same thing. > Note that the type number operators, __add__ and such, currently use > __numpy_ufunc__ in combination with __array_priority__, this in addition to > the fact that they are by default using ufuncs that do the same. I'd rather > that the __*__ operators simply rely on __array_priority__. The whole business of __array_priority__ and __numpy_ufunc__ in the binary ops is solely about when ____ should yield the execution to __r__ of the other object. The rule of operation currently is: "__rmul__ before __numpy_ufunc__" If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ before __numpy_ufunc__, except if array_priority happens to be smaller than that of the other class and your class is not an ndarray subclass". The following binops also do not IIRC respect __array_priority__ in preferring right-hand operand: - in-place operations - comparisons One question here is whether it's possible to change the behavior of __array_priority__ here at all, or whether changes are possible only in the context of adding new attributes telling Numpy what to do. -- Pauli Virtanen From charlesr.harris at gmail.com Wed Sep 10 16:53:22 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 10 Sep 2014 14:53:22 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: > 09.09.2014, 22:52, Charles R Harris kirjoitti: > > 1. Should the operator accept array_like for one of the arguments? > > 2. Does it need to handle __numpy_ufunc__, or will > > __array_priority__ serve? > > I think the __matmul__ operator implementation should follow that of > __mul__. > > [clip] > > 3. Do we want PyArray_Matmul in the numpy API? > > 4. Should a matmul function be supplied by the multiarray module? > > > > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, or > > will __array_priority__ serve? > > dot() function deals with __numpy_ufunc__, and the matmul() function > should behave similarly. > > It seems dot() uses __array_priority__ for selection of output return > subclass, so matmul() probably needs do the same thing. > > > Note that the type number operators, __add__ and such, currently use > > __numpy_ufunc__ in combination with __array_priority__, this in addition > to > > the fact that they are by default using ufuncs that do the same. I'd > rather > > that the __*__ operators simply rely on __array_priority__. > > The whole business of __array_priority__ and __numpy_ufunc__ in the > binary ops is solely about when ____ should yield the execution to > __r__ of the other object. > > The rule of operation currently is: "__rmul__ before __numpy_ufunc__" > > If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ before > __numpy_ufunc__, except if array_priority happens to be smaller than > that of the other class and your class is not an ndarray subclass". > > The following binops also do not IIRC respect __array_priority__ in > preferring right-hand operand: > > - in-place operations > - comparisons > > One question here is whether it's possible to change the behavior of > __array_priority__ here at all, or whether changes are possible only in > the context of adding new attributes telling Numpy what to do. > > I was tempted to make it a generalized ufunc, which would take care of a lot of things, but there is a lot of overhead in those functions. Sounds like the easiest thing is to make it similar to dot, although having an inplace versions complicates the type selection a bit. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 10 16:55:20 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 10 Sep 2014 14:55:20 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Wed, Sep 10, 2014 at 2:53 PM, Charles R Harris wrote: > > > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: > >> 09.09.2014, 22:52, Charles R Harris kirjoitti: >> > 1. Should the operator accept array_like for one of the arguments? >> > 2. Does it need to handle __numpy_ufunc__, or will >> > __array_priority__ serve? >> >> I think the __matmul__ operator implementation should follow that of >> __mul__. >> >> [clip] >> > 3. Do we want PyArray_Matmul in the numpy API? >> > 4. Should a matmul function be supplied by the multiarray module? >> > >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, or >> > will __array_priority__ serve? >> >> dot() function deals with __numpy_ufunc__, and the matmul() function >> should behave similarly. >> >> It seems dot() uses __array_priority__ for selection of output return >> subclass, so matmul() probably needs do the same thing. >> >> > Note that the type number operators, __add__ and such, currently use >> > __numpy_ufunc__ in combination with __array_priority__, this in >> addition to >> > the fact that they are by default using ufuncs that do the same. I'd >> rather >> > that the __*__ operators simply rely on __array_priority__. >> >> The whole business of __array_priority__ and __numpy_ufunc__ in the >> binary ops is solely about when ____ should yield the execution to >> __r__ of the other object. >> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" >> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ before >> __numpy_ufunc__, except if array_priority happens to be smaller than >> that of the other class and your class is not an ndarray subclass". >> >> The following binops also do not IIRC respect __array_priority__ in >> preferring right-hand operand: >> >> - in-place operations >> - comparisons >> >> One question here is whether it's possible to change the behavior of >> __array_priority__ here at all, or whether changes are possible only in >> the context of adding new attributes telling Numpy what to do. >> >> > I was tempted to make it a generalized ufunc, which would take care of a > lot of things, but there is a lot of overhead in those functions. Sounds > like the easiest thing is to make it similar to dot, although having an > inplace versions complicates the type selection a bit. > Note also that the dot cblas versions are not generally blocked, so the size of the arrays is limited (and not checked). Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Wed Sep 10 22:18:24 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Thu, 11 Sep 2014 02:18:24 +0000 (UTC) Subject: [Numpy-discussion] @ operator References: Message-ID: <1927663638432094396.879571sturla.molden-gmail.com@news.gmane.org> Charles R Harris wrote: > Note also that the dot cblas versions are not generally blocked, so the > size of the arrays is limited (and not checked). But it is possible to create a blocked dot function with the current cblas, even though they use C int for array dimensions. It would just further increase the complexity of dot (as if it's not bad enough already...) Sturla From njs at pobox.com Thu Sep 11 00:08:49 2014 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 11 Sep 2014 00:08:49 -0400 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris wrote: > > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: >> >> 09.09.2014, 22:52, Charles R Harris kirjoitti: >> > 1. Should the operator accept array_like for one of the arguments? >> > 2. Does it need to handle __numpy_ufunc__, or will >> > __array_priority__ serve? >> >> I think the __matmul__ operator implementation should follow that of >> __mul__. >> >> [clip] >> > 3. Do we want PyArray_Matmul in the numpy API? >> > 4. Should a matmul function be supplied by the multiarray module? >> > >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, or >> > will __array_priority__ serve? >> >> dot() function deals with __numpy_ufunc__, and the matmul() function >> should behave similarly. >> >> It seems dot() uses __array_priority__ for selection of output return >> subclass, so matmul() probably needs do the same thing. >> >> > Note that the type number operators, __add__ and such, currently use >> > __numpy_ufunc__ in combination with __array_priority__, this in addition >> > to >> > the fact that they are by default using ufuncs that do the same. I'd >> > rather >> > that the __*__ operators simply rely on __array_priority__. >> >> The whole business of __array_priority__ and __numpy_ufunc__ in the >> binary ops is solely about when ____ should yield the execution to >> __r__ of the other object. >> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" >> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ before >> __numpy_ufunc__, except if array_priority happens to be smaller than >> that of the other class and your class is not an ndarray subclass". >> >> The following binops also do not IIRC respect __array_priority__ in >> preferring right-hand operand: >> >> - in-place operations >> - comparisons >> >> One question here is whether it's possible to change the behavior of >> __array_priority__ here at all, or whether changes are possible only in >> the context of adding new attributes telling Numpy what to do. > > I was tempted to make it a generalized ufunc, which would take care of a lot > of things, but there is a lot of overhead in those functions. Sounds like > the easiest thing is to make it similar to dot, although having an inplace > versions complicates the type selection a bit. Can we please fix the overhead instead of adding more half-complete implementations of the same concepts? I feel like this usually ends up slowing things down in the end, as optimization efforts get divided... My vote is: __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ methods do (so I guess check __array_priority__ or whatever it is we always do). I'd also be okay with ignoring __array_priority__ on the grounds that __numpy_ufunc__ is better, and there's no existing code relying on __array_priority__ support in __matmul__. Having decided that we are actually going to run, they dispatch unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the in-place version), similarly to how e.g. __add__ dispatches to np.add. newdot acts like a standard gufunc with all the standard niceties, including __numpy_ufunc__ dispatch. ("newdot" here is intended as a placeholder name, maybe it should be np.linalg.matmul or something else to be bikeshed later. I also vote that eventually 'dot' become an alias for this function, but whether to do that is an orthogonal discussion for later.) -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From ndbecker2 at gmail.com Thu Sep 11 10:27:51 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 11 Sep 2014 10:27:51 -0400 Subject: [Numpy-discussion] why does u.resize return None? Message-ID: It would be useful if u.resize returned the new array, so it could be used for chaining operations -- -- Those who don't understand recursion are doomed to repeat it From hoogendoorn.eelco at gmail.com Thu Sep 11 10:45:53 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Thu, 11 Sep 2014 16:45:53 +0200 Subject: [Numpy-discussion] why does u.resize return None? In-Reply-To: References: Message-ID: agreed; I never saw the logic in returning none either. On Thu, Sep 11, 2014 at 4:27 PM, Neal Becker wrote: > It would be useful if u.resize returned the new array, so it could be used > for > chaining operations > > -- > -- Those who don't understand recursion are doomed to repeat it > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Sep 11 10:59:28 2014 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 11 Sep 2014 15:59:28 +0100 Subject: [Numpy-discussion] why does u.resize return None? In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 3:27 PM, Neal Becker wrote: > It would be useful if u.resize returned the new array, so it could be used for > chaining operations Same reason why list.sort() and friends return None. https://docs.python.org/2/faq/design.html#why-doesn-t-list-sort-return-the-sorted-list -- Robert Kern From ndbecker2 at gmail.com Thu Sep 11 10:56:02 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 11 Sep 2014 10:56:02 -0400 Subject: [Numpy-discussion] why does u.resize return None? References: Message-ID: https://github.com/numpy/numpy/issues/5064 Eelco Hoogendoorn wrote: > agreed; I never saw the logic in returning none either. > > On Thu, Sep 11, 2014 at 4:27 PM, Neal Becker wrote: > >> It would be useful if u.resize returned the new array, so it could be used >> for >> chaining operations >> >> -- >> -- Those who don't understand recursion are doomed to repeat it >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> -- -- Those who don't understand recursion are doomed to repeat it From jhaiduce at gmail.com Thu Sep 11 11:46:35 2014 From: jhaiduce at gmail.com (John Haiducek) Date: Thu, 11 Sep 2014 11:46:35 -0400 Subject: [Numpy-discussion] How to get docs for functions processed by numpy.vectorize? In-Reply-To: References: Message-ID: <5411C3DB.8020501@gmail.com> When I apply numpy.vectorize() to a function, documentation tools behave inconsistently with regard to the new, vectorized function. The function's __doc__ attribute does contain the docstring of the original function as expected, but the built-in help() command displays the documentation of the numpy.vectorize class, and sphinx-autodoc fails to display the function at all. Is there a way to get sphinx-autodoc and the built-in help() command to display the docstring of the function and not something else? For instance: >>> import numpy as np >>> def myfunc(x): ... "Square x" ... return x**2 ... >>> myfunc=np.vectorize(myfunc) >>> print myfunc.__doc__ Square x >>> help(myfunc) (displays documentation of np.vectorize) From charlesr.harris at gmail.com Thu Sep 11 12:10:50 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 11 Sep 2014 10:10:50 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: > On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris > wrote: > > > > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: > >> > >> 09.09.2014, 22:52, Charles R Harris kirjoitti: > >> > 1. Should the operator accept array_like for one of the arguments? > >> > 2. Does it need to handle __numpy_ufunc__, or will > >> > __array_priority__ serve? > >> > >> I think the __matmul__ operator implementation should follow that of > >> __mul__. > >> > >> [clip] > >> > 3. Do we want PyArray_Matmul in the numpy API? > >> > 4. Should a matmul function be supplied by the multiarray module? > >> > > >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ machinery, > or > >> > will __array_priority__ serve? > >> > >> dot() function deals with __numpy_ufunc__, and the matmul() function > >> should behave similarly. > >> > >> It seems dot() uses __array_priority__ for selection of output return > >> subclass, so matmul() probably needs do the same thing. > >> > >> > Note that the type number operators, __add__ and such, currently use > >> > __numpy_ufunc__ in combination with __array_priority__, this in > addition > >> > to > >> > the fact that they are by default using ufuncs that do the same. I'd > >> > rather > >> > that the __*__ operators simply rely on __array_priority__. > >> > >> The whole business of __array_priority__ and __numpy_ufunc__ in the > >> binary ops is solely about when ____ should yield the execution to > >> __r__ of the other object. > >> > >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" > >> > >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ before > >> __numpy_ufunc__, except if array_priority happens to be smaller than > >> that of the other class and your class is not an ndarray subclass". > >> > >> The following binops also do not IIRC respect __array_priority__ in > >> preferring right-hand operand: > >> > >> - in-place operations > >> - comparisons > >> > >> One question here is whether it's possible to change the behavior of > >> __array_priority__ here at all, or whether changes are possible only in > >> the context of adding new attributes telling Numpy what to do. > > > > I was tempted to make it a generalized ufunc, which would take care of a > lot > > of things, but there is a lot of overhead in those functions. Sounds like > > the easiest thing is to make it similar to dot, although having an > inplace > > versions complicates the type selection a bit. > > Can we please fix the overhead instead of adding more half-complete > implementations of the same concepts? I feel like this usually ends up > slowing things down in the end, as optimization efforts get divided... > > My vote is: > > __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ > methods do (so I guess check __array_priority__ or whatever it is we > always do). I'd also be okay with ignoring __array_priority__ on the > grounds that __numpy_ufunc__ is better, and there's no existing code > relying on __array_priority__ support in __matmul__. > > Having decided that we are actually going to run, they dispatch > unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the > in-place version), similarly to how e.g. __add__ dispatches to np.add. > > newdot acts like a standard gufunc with all the standard niceties, > including __numpy_ufunc__ dispatch. > > ("newdot" here is intended as a placeholder name, maybe it should be > np.linalg.matmul or something else to be bikeshed later. I also vote > that eventually 'dot' become an alias for this function, but whether > to do that is an orthogonal discussion for later.) > > If we went the ufunc route, I think we would want three of them, matxvec, vecxmat, and matxmat, because the best inner loops would be different in the three cases, but they couldn't be straight ufuncs themselves, as we don't need the other options, `reduce`, etc., but they can't be exactly like the linalg machinery, because we do want subclasses to be able to override. Hmm... The ufunc machinery has some funky aspects. For instance, there are hardwired checks for `__radd__` and other such operators in PyUFunc_GenericFunction that allows subclasses to overide the ufunc. Those options should really be part of the PyUFuncObject. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Sep 11 21:47:15 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 11 Sep 2014 19:47:15 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 10:10 AM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: > >> On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris >> wrote: >> > >> > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: >> >> >> >> 09.09.2014, 22:52, Charles R Harris kirjoitti: >> >> > 1. Should the operator accept array_like for one of the arguments? >> >> > 2. Does it need to handle __numpy_ufunc__, or will >> >> > __array_priority__ serve? >> >> >> >> I think the __matmul__ operator implementation should follow that of >> >> __mul__. >> >> >> >> [clip] >> >> > 3. Do we want PyArray_Matmul in the numpy API? >> >> > 4. Should a matmul function be supplied by the multiarray module? >> >> > >> >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ >> machinery, or >> >> > will __array_priority__ serve? >> >> >> >> dot() function deals with __numpy_ufunc__, and the matmul() function >> >> should behave similarly. >> >> >> >> It seems dot() uses __array_priority__ for selection of output return >> >> subclass, so matmul() probably needs do the same thing. >> >> >> >> > Note that the type number operators, __add__ and such, currently use >> >> > __numpy_ufunc__ in combination with __array_priority__, this in >> addition >> >> > to >> >> > the fact that they are by default using ufuncs that do the same. I'd >> >> > rather >> >> > that the __*__ operators simply rely on __array_priority__. >> >> >> >> The whole business of __array_priority__ and __numpy_ufunc__ in the >> >> binary ops is solely about when ____ should yield the execution to >> >> __r__ of the other object. >> >> >> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" >> >> >> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ >> before >> >> __numpy_ufunc__, except if array_priority happens to be smaller than >> >> that of the other class and your class is not an ndarray subclass". >> >> >> >> The following binops also do not IIRC respect __array_priority__ in >> >> preferring right-hand operand: >> >> >> >> - in-place operations >> >> - comparisons >> >> >> >> One question here is whether it's possible to change the behavior of >> >> __array_priority__ here at all, or whether changes are possible only in >> >> the context of adding new attributes telling Numpy what to do. >> > >> > I was tempted to make it a generalized ufunc, which would take care of >> a lot >> > of things, but there is a lot of overhead in those functions. Sounds >> like >> > the easiest thing is to make it similar to dot, although having an >> inplace >> > versions complicates the type selection a bit. >> >> Can we please fix the overhead instead of adding more half-complete >> implementations of the same concepts? I feel like this usually ends up >> slowing things down in the end, as optimization efforts get divided... >> >> My vote is: >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ >> methods do (so I guess check __array_priority__ or whatever it is we >> always do). I'd also be okay with ignoring __array_priority__ on the >> grounds that __numpy_ufunc__ is better, and there's no existing code >> relying on __array_priority__ support in __matmul__. >> >> Having decided that we are actually going to run, they dispatch >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >> in-place version), similarly to how e.g. __add__ dispatches to np.add. >> >> newdot acts like a standard gufunc with all the standard niceties, >> including __numpy_ufunc__ dispatch. >> >> ("newdot" here is intended as a placeholder name, maybe it should be >> np.linalg.matmul or something else to be bikeshed later. I also vote >> that eventually 'dot' become an alias for this function, but whether >> to do that is an orthogonal discussion for later.) >> >> If we went the ufunc route, I think we would want three of them, matxvec, > vecxmat, and matxmat, because the best inner loops would be different in > the three cases, but they couldn't be straight ufuncs themselves, as we > don't need the other options, `reduce`, etc., but they can't be exactly > like the linalg machinery, because we do want subclasses to be able to > override. Hmm... > > The ufunc machinery has some funky aspects. For instance, there are > hardwired checks for `__radd__` and other such operators in > PyUFunc_GenericFunction that allows subclasses to overide the ufunc. Those > options should really be part of the PyUFuncObject. > Here are some timing results for the inner product of an integer array comparing inner, ufunc (inner1d), and einsum implementations. The np.long type was chosen because it is implemented for inner1d and to avoid the effect of using cblas. In [43]: a = ones(100000, dtype=np.long) In [44]: timeit inner(a,a) 10000 loops, best of 3: 55.5 ?s per loop In [45]: timeit inner1d(a,a) 10000 loops, best of 3: 56.2 ?s per loop In [46]: timeit einsum('...i,...i',a,a) 10000 loops, best of 3: 43.8 ?s per loop In [47]: a = ones(1, dtype=np.long) In [48]: timeit inner(a,a) 1000000 loops, best of 3: 484 ns per loop In [49]: timeit inner1d(a,a) 1000000 loops, best of 3: 741 ns per loop In [50]: timeit einsum('...i,...i',a,a) 1000000 loops, best of 3: 811 ns per loop For big arrays, einsum has a better inner loop. For small arrays, einsum and inner1d suffer from call overhead. Note that the einsum overhead could be improved by special casing, and that the various loops could be consolidated into best of breed. The ufunc implementation is easy to do for all cases. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Sep 11 22:01:14 2014 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 11 Sep 2014 22:01:14 -0400 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris wrote: > > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: >> >> My vote is: >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ >> methods do (so I guess check __array_priority__ or whatever it is we >> always do). I'd also be okay with ignoring __array_priority__ on the >> grounds that __numpy_ufunc__ is better, and there's no existing code >> relying on __array_priority__ support in __matmul__. >> >> Having decided that we are actually going to run, they dispatch >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >> in-place version), similarly to how e.g. __add__ dispatches to np.add. >> >> newdot acts like a standard gufunc with all the standard niceties, >> including __numpy_ufunc__ dispatch. >> >> ("newdot" here is intended as a placeholder name, maybe it should be >> np.linalg.matmul or something else to be bikeshed later. I also vote >> that eventually 'dot' become an alias for this function, but whether >> to do that is an orthogonal discussion for later.) >> > If we went the ufunc route, I think we would want three of them, matxvec, > vecxmat, and matxmat, because the best inner loops would be different in the > three cases, Couldn't we write a single inner loop like: void ufunc_loop(blah blah) { if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { call DOT } else if (arg2_shape[0] == 1) { call GEMV } else if (...) { ... } else { call GEMM } } ? I can't see any reason that this would be measureably slower than having multiple ufunc loops: the checks are extremely cheap, will usually only be done once (because most @ calls will only enter the inner loop once), and if they are done repeatedly in the same call then they'll come out the same way very time and thus be predictable branches. > but they couldn't be straight ufuncs themselves, as we don't > need the other options, `reduce`, etc., Not sure what you mean -- ATM gufuncs don't support reduce anyway, but if someone felt like implementing it then it would be cool to get dot.reduce for free -- it is a useful/meaningful operation. Is there some reason that supporting more ufunc features is bad? > but they can't be exactly like the > linalg machinery, because we do want subclasses to be able to override. Subclasses can override ufuncs using __numpy_ufunc__. > Hmm... > > The ufunc machinery has some funky aspects. For instance, there are > hardwired checks for `__radd__` and other such operators in > PyUFunc_GenericFunction that allows subclasses to overide the ufunc. Those > options should really be part of the PyUFuncObject. Are you mentioning this b/c it's an annoying thing that talking about ufuncs reminded you of, or is there a specific impact on __matmul__ that you see? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From charlesr.harris at gmail.com Thu Sep 11 22:03:03 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 11 Sep 2014 20:03:03 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 7:47 PM, Charles R Harris wrote: > > > On Thu, Sep 11, 2014 at 10:10 AM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: >> >>> On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris >>> wrote: >>> > >>> > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen wrote: >>> >> >>> >> 09.09.2014, 22:52, Charles R Harris kirjoitti: >>> >> > 1. Should the operator accept array_like for one of the arguments? >>> >> > 2. Does it need to handle __numpy_ufunc__, or will >>> >> > __array_priority__ serve? >>> >> >>> >> I think the __matmul__ operator implementation should follow that of >>> >> __mul__. >>> >> >>> >> [clip] >>> >> > 3. Do we want PyArray_Matmul in the numpy API? >>> >> > 4. Should a matmul function be supplied by the multiarray module? >>> >> > >>> >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ >>> machinery, or >>> >> > will __array_priority__ serve? >>> >> >>> >> dot() function deals with __numpy_ufunc__, and the matmul() function >>> >> should behave similarly. >>> >> >>> >> It seems dot() uses __array_priority__ for selection of output return >>> >> subclass, so matmul() probably needs do the same thing. >>> >> >>> >> > Note that the type number operators, __add__ and such, currently use >>> >> > __numpy_ufunc__ in combination with __array_priority__, this in >>> addition >>> >> > to >>> >> > the fact that they are by default using ufuncs that do the same. I'd >>> >> > rather >>> >> > that the __*__ operators simply rely on __array_priority__. >>> >> >>> >> The whole business of __array_priority__ and __numpy_ufunc__ in the >>> >> binary ops is solely about when ____ should yield the execution to >>> >> __r__ of the other object. >>> >> >>> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" >>> >> >>> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ >>> before >>> >> __numpy_ufunc__, except if array_priority happens to be smaller than >>> >> that of the other class and your class is not an ndarray subclass". >>> >> >>> >> The following binops also do not IIRC respect __array_priority__ in >>> >> preferring right-hand operand: >>> >> >>> >> - in-place operations >>> >> - comparisons >>> >> >>> >> One question here is whether it's possible to change the behavior of >>> >> __array_priority__ here at all, or whether changes are possible only >>> in >>> >> the context of adding new attributes telling Numpy what to do. >>> > >>> > I was tempted to make it a generalized ufunc, which would take care of >>> a lot >>> > of things, but there is a lot of overhead in those functions. Sounds >>> like >>> > the easiest thing is to make it similar to dot, although having an >>> inplace >>> > versions complicates the type selection a bit. >>> >>> Can we please fix the overhead instead of adding more half-complete >>> implementations of the same concepts? I feel like this usually ends up >>> slowing things down in the end, as optimization efforts get divided... >>> >>> My vote is: >>> >>> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ >>> methods do (so I guess check __array_priority__ or whatever it is we >>> always do). I'd also be okay with ignoring __array_priority__ on the >>> grounds that __numpy_ufunc__ is better, and there's no existing code >>> relying on __array_priority__ support in __matmul__. >>> >>> Having decided that we are actually going to run, they dispatch >>> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >>> in-place version), similarly to how e.g. __add__ dispatches to np.add. >>> >>> newdot acts like a standard gufunc with all the standard niceties, >>> including __numpy_ufunc__ dispatch. >>> >>> ("newdot" here is intended as a placeholder name, maybe it should be >>> np.linalg.matmul or something else to be bikeshed later. I also vote >>> that eventually 'dot' become an alias for this function, but whether >>> to do that is an orthogonal discussion for later.) >>> >>> If we went the ufunc route, I think we would want three of them, >> matxvec, vecxmat, and matxmat, because the best inner loops would be >> different in the three cases, but they couldn't be straight ufuncs >> themselves, as we don't need the other options, `reduce`, etc., but they >> can't be exactly like the linalg machinery, because we do want subclasses >> to be able to override. Hmm... >> >> The ufunc machinery has some funky aspects. For instance, there are >> hardwired checks for `__radd__` and other such operators in >> PyUFunc_GenericFunction that allows subclasses to overide the ufunc. Those >> options should really be part of the PyUFuncObject. >> > > Here are some timing results for the inner product of an integer array > comparing inner, ufunc (inner1d), and einsum implementations. The np.long > type was chosen because it is implemented for inner1d and to avoid the > effect of using cblas. > > In [43]: a = ones(100000, dtype=np.long) > > In [44]: timeit inner(a,a) > 10000 loops, best of 3: 55.5 ?s per loop > > In [45]: timeit inner1d(a,a) > 10000 loops, best of 3: 56.2 ?s per loop > > In [46]: timeit einsum('...i,...i',a,a) > 10000 loops, best of 3: 43.8 ?s per loop > > In [47]: a = ones(1, dtype=np.long) > > In [48]: timeit inner(a,a) > 1000000 loops, best of 3: 484 ns per loop > > In [49]: timeit inner1d(a,a) > 1000000 loops, best of 3: 741 ns per loop > > In [50]: timeit einsum('...i,...i',a,a) > 1000000 loops, best of 3: 811 ns per loop > > For big arrays, einsum has a better inner loop. For small arrays, einsum > and inner1d suffer from call overhead. Note that the einsum overhead could > be improved by special casing, and that the various loops could be > consolidated into best of breed. > > The ufunc implementation is easy to do for all cases. > > Same thing for doubles. The speedup due to cblas can be seen, and the iterator overhead becomes more visible. In [51]: a = ones(100000, dtype=np.double) In [52]: timeit inner(a,a) 10000 loops, best of 3: 32.1 ?s per loop In [53]: timeit inner1d(a,a) 10000 loops, best of 3: 134 ?s per loop In [54]: timeit einsum('...i,...i',a,a) 10000 loops, best of 3: 42 ?s per loop In [55]: a = ones(1, dtype=np.double) In [56]: timeit inner(a,a) 1000000 loops, best of 3: 336 ns per loop In [57]: timeit inner1d(a,a) 1000000 loops, best of 3: 742 ns per loop In [58]: timeit einsum('...i,...i',a,a) 1000000 loops, best of 3: 809 ns per loop But stacked vectors make a difference In [59]: a = ones((2, 1), dtype=np.double) In [60]: timeit for i in range(2): inner(a[i],a[i]) 1000000 loops, best of 3: 1.39 ?s per loop In [61]: timeit inner1d(a,a) 1000000 loops, best of 3: 749 ns per loop In [62]: timeit einsum('...i,...i',a,a) 1000000 loops, best of 3: 888 ns per loop Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Sep 11 22:12:43 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 11 Sep 2014 20:12:43 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 8:01 PM, Nathaniel Smith wrote: > On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris > wrote: > > > > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: > >> > >> My vote is: > >> > >> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ > >> methods do (so I guess check __array_priority__ or whatever it is we > >> always do). I'd also be okay with ignoring __array_priority__ on the > >> grounds that __numpy_ufunc__ is better, and there's no existing code > >> relying on __array_priority__ support in __matmul__. > >> > >> Having decided that we are actually going to run, they dispatch > >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the > >> in-place version), similarly to how e.g. __add__ dispatches to np.add. > >> > >> newdot acts like a standard gufunc with all the standard niceties, > >> including __numpy_ufunc__ dispatch. > >> > >> ("newdot" here is intended as a placeholder name, maybe it should be > >> np.linalg.matmul or something else to be bikeshed later. I also vote > >> that eventually 'dot' become an alias for this function, but whether > >> to do that is an orthogonal discussion for later.) > >> > > If we went the ufunc route, I think we would want three of them, matxvec, > > vecxmat, and matxmat, because the best inner loops would be different in > the > > three cases, > > Couldn't we write a single inner loop like: > > void ufunc_loop(blah blah) { > if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { > call DOT > } else if (arg2_shape[0] == 1) { > call GEMV > } else if (...) { > ... > } else { > call GEMM > } > } > ? > > Not for generalized ufuncs, different signatures, or if linearized, more info on dimensions. What you show is essentially what dot does now for cblas enabled functions. But note, we need more than the simple '@', we also need stacks of vectors, and turning vectors into matrices, and then back into vectors seems unnecessarily complicated. > I can't see any reason that this would be measureably slower than > having multiple ufunc loops: the checks are extremely cheap, will > usually only be done once (because most @ calls will only enter the > inner loop once), and if they are done repeatedly in the same call > then they'll come out the same way very time and thus be predictable > branches. > It's not the loops that are expensive. > > > but they couldn't be straight ufuncs themselves, as we don't > > need the other options, `reduce`, etc., > > Not sure what you mean -- ATM gufuncs don't support reduce anyway, but > if someone felt like implementing it then it would be cool to get > dot.reduce for free -- it is a useful/meaningful operation. Is there > some reason that supporting more ufunc features is bad? > > > but they can't be exactly like the > > linalg machinery, because we do want subclasses to be able to override. > > Subclasses can override ufuncs using __numpy_ufunc__. > Yep. > > > Hmm... > > > > The ufunc machinery has some funky aspects. For instance, there are > > hardwired checks for `__radd__` and other such operators in > > PyUFunc_GenericFunction that allows subclasses to overide the ufunc. > Those > > options should really be part of the PyUFuncObject. > > Are you mentioning this b/c it's an annoying thing that talking about > ufuncs reminded you of, or is there a specific impact on __matmul__ > that you see? > Just because it seems out of place. numpy_ufunc is definitely a better way to do this. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Sep 11 22:49:21 2014 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 11 Sep 2014 22:49:21 -0400 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 10:12 PM, Charles R Harris wrote: > > On Thu, Sep 11, 2014 at 8:01 PM, Nathaniel Smith wrote: >> >> On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris >> wrote: >> > >> > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith wrote: >> >> >> >> My vote is: >> >> >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ >> >> methods do (so I guess check __array_priority__ or whatever it is we >> >> always do). I'd also be okay with ignoring __array_priority__ on the >> >> grounds that __numpy_ufunc__ is better, and there's no existing code >> >> relying on __array_priority__ support in __matmul__. >> >> >> >> Having decided that we are actually going to run, they dispatch >> >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >> >> in-place version), similarly to how e.g. __add__ dispatches to np.add. >> >> >> >> newdot acts like a standard gufunc with all the standard niceties, >> >> including __numpy_ufunc__ dispatch. >> >> >> >> ("newdot" here is intended as a placeholder name, maybe it should be >> >> np.linalg.matmul or something else to be bikeshed later. I also vote >> >> that eventually 'dot' become an alias for this function, but whether >> >> to do that is an orthogonal discussion for later.) >> >> >> > If we went the ufunc route, I think we would want three of them, >> > matxvec, >> > vecxmat, and matxmat, because the best inner loops would be different in >> > the >> > three cases, >> >> Couldn't we write a single inner loop like: >> >> void ufunc_loop(blah blah) { >> if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { >> call DOT >> } else if (arg2_shape[0] == 1) { >> call GEMV >> } else if (...) { >> ... >> } else { >> call GEMM >> } >> } >> ? > > Not for generalized ufuncs, different signatures, or if linearized, more > info on dimensions. This sentence no verb, but I think the point you might be raising is: we don't actually have the technical capability to define a single gufunc for @, because the matmat, matvec, vecmat, and vecvec forms have different gufunc signatures ("mn,nk->mk", "mn,n->m", "n,nk->k", and "n,n->" respectively, I think)? This is true, but Jaime has said he's willing to look at fixing this :-): http://thread.gmane.org/gmane.comp.python.numeric.general/58669/focus=58670 ...and fundamentally, it's very difficult to solve this anywhere else except the ufunc internals. When we first enter a function like newdot, we need to check for special overloads like __numpy_ufunc__ *before* we do any array_like->ndarray coercion. But we have to do array_like->ndarray coercion before we know what the shape/ndim of the our inputs is. And in the wrapper-around-multiple-ufuncs approach, we have to check the shape/ndim before we choose which ufunc to dispatch to. So __numpy_ufunc__ won't get checked until after we've already coerced to ndarray, oops... OTOH the ufunc internals already know how to do this dance, so it's at least straightforward (if not necessarily trivial) to insert the correct logic once in the correct place. > What you show is essentially what dot does now for cblas > enabled functions. But note, we need more than the simple '@', we also need > stacks of vectors, and turning vectors into matrices, and then back into > vectors seems unnecessarily complicated. By the time we get into the data/shape/strides world that ufuncs work with, converting vectors->matrices is literally just adding a single entry to the shape+strides arrays. This feels trivial and cheap to me? Or do you just mean that we do actually want broadcasting matvec, vecmat, vecvec gufuncs? I agree with this too but it seems orthogonal to me -- we can have those in any case, even if newdot doesn't literally call them. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From charlesr.harris at gmail.com Thu Sep 11 23:18:02 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 11 Sep 2014 21:18:02 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 8:49 PM, Nathaniel Smith wrote: > On Thu, Sep 11, 2014 at 10:12 PM, Charles R Harris > wrote: > > > > On Thu, Sep 11, 2014 at 8:01 PM, Nathaniel Smith wrote: > >> > >> On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris > >> wrote: > >> > > >> > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith > wrote: > >> >> > >> >> My vote is: > >> >> > >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ > >> >> methods do (so I guess check __array_priority__ or whatever it is we > >> >> always do). I'd also be okay with ignoring __array_priority__ on the > >> >> grounds that __numpy_ufunc__ is better, and there's no existing code > >> >> relying on __array_priority__ support in __matmul__. > >> >> > >> >> Having decided that we are actually going to run, they dispatch > >> >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the > >> >> in-place version), similarly to how e.g. __add__ dispatches to > np.add. > >> >> > >> >> newdot acts like a standard gufunc with all the standard niceties, > >> >> including __numpy_ufunc__ dispatch. > >> >> > >> >> ("newdot" here is intended as a placeholder name, maybe it should be > >> >> np.linalg.matmul or something else to be bikeshed later. I also vote > >> >> that eventually 'dot' become an alias for this function, but whether > >> >> to do that is an orthogonal discussion for later.) > >> >> > >> > If we went the ufunc route, I think we would want three of them, > >> > matxvec, > >> > vecxmat, and matxmat, because the best inner loops would be different > in > >> > the > >> > three cases, > >> > >> Couldn't we write a single inner loop like: > >> > >> void ufunc_loop(blah blah) { > >> if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { > >> call DOT > >> } else if (arg2_shape[0] == 1) { > >> call GEMV > >> } else if (...) { > >> ... > >> } else { > >> call GEMM > >> } > >> } > >> ? > > > > Not for generalized ufuncs, different signatures, or if linearized, more > > info on dimensions. > > This sentence no verb, but I think the point you might be raising is: > we don't actually have the technical capability to define a single > gufunc for @, because the matmat, matvec, vecmat, and vecvec forms > have different gufunc signatures ("mn,nk->mk", "mn,n->m", "n,nk->k", > and "n,n->" respectively, I think)? > > This is true, but Jaime has said he's willing to look at fixing this :-): > > http://thread.gmane.org/gmane.comp.python.numeric.general/58669/focus=58670 > > Don't see the need here, the loops are not complicated. > ...and fundamentally, it's very difficult to solve this anywhere else > except the ufunc internals. When we first enter a function like > newdot, we need to check for special overloads like __numpy_ufunc__ > *before* we do any array_like->ndarray coercion. But we have to do > array_like->ndarray coercion before we know what the shape/ndim of the > our inputs is. True. > And in the wrapper-around-multiple-ufuncs approach, we > have to check the shape/ndim before we choose which ufunc to dispatch > to. True. I don't see a problem there and the four ufuncs would be useful in themselves. I think they should be part of the multiarray module methods if we go that way. So __numpy_ufunc__ won't get checked until after we've already > coerced to ndarray, oops... OTOH the ufunc internals already know how > to do this dance, so it's at least straightforward (if not necessarily > trivial) to insert the correct logic once in the correct place. > I'm thinking that the four ufuncs being overridden by user subtypes should be sufficient. > > > What you show is essentially what dot does now for cblas > > enabled functions. But note, we need more than the simple '@', we also > need > > stacks of vectors, and turning vectors into matrices, and then back into > > vectors seems unnecessarily complicated. > > By the time we get into the data/shape/strides world that ufuncs work > with, converting vectors->matrices is literally just adding a single > entry to the shape+strides arrays. This feels trivial and cheap to me? > And moving things around, then removing. It's an option if we just want to use matrix multiplication for everything. I don't think there is any speed advantage one way or the other, although there are currently size limitations that are easier to block in the 1D case than the matrix case. In practice 2**31 per dimension for floating point is probably plenty. > > Or do you just mean that we do actually want broadcasting matvec, > vecmat, vecvec gufuncs? I agree with this too but it seems orthogonal > to me -- we can have those in any case, even if newdot doesn't > literally call them. > Yes. But if we have them, we might as well use them. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Sep 12 01:09:15 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 12 Sep 2014 01:09:15 -0400 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 11:18 PM, Charles R Harris wrote: > > On Thu, Sep 11, 2014 at 8:49 PM, Nathaniel Smith wrote: >> >> On Thu, Sep 11, 2014 at 10:12 PM, Charles R Harris >> wrote: >> > >> > On Thu, Sep 11, 2014 at 8:01 PM, Nathaniel Smith wrote: >> >> >> >> On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris >> >> wrote: >> >> > >> >> > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith >> >> > wrote: >> >> >> >> >> >> My vote is: >> >> >> >> >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all >> >> >> __op__ >> >> >> methods do (so I guess check __array_priority__ or whatever it is we >> >> >> always do). I'd also be okay with ignoring __array_priority__ on the >> >> >> grounds that __numpy_ufunc__ is better, and there's no existing code >> >> >> relying on __array_priority__ support in __matmul__. >> >> >> >> >> >> Having decided that we are actually going to run, they dispatch >> >> >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >> >> >> in-place version), similarly to how e.g. __add__ dispatches to >> >> >> np.add. >> >> >> >> >> >> newdot acts like a standard gufunc with all the standard niceties, >> >> >> including __numpy_ufunc__ dispatch. >> >> >> >> >> >> ("newdot" here is intended as a placeholder name, maybe it should be >> >> >> np.linalg.matmul or something else to be bikeshed later. I also vote >> >> >> that eventually 'dot' become an alias for this function, but whether >> >> >> to do that is an orthogonal discussion for later.) >> >> >> >> >> > If we went the ufunc route, I think we would want three of them, >> >> > matxvec, >> >> > vecxmat, and matxmat, because the best inner loops would be different >> >> > in >> >> > the >> >> > three cases, >> >> >> >> Couldn't we write a single inner loop like: >> >> >> >> void ufunc_loop(blah blah) { >> >> if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { >> >> call DOT >> >> } else if (arg2_shape[0] == 1) { >> >> call GEMV >> >> } else if (...) { >> >> ... >> >> } else { >> >> call GEMM >> >> } >> >> } >> >> ? >> > >> > Not for generalized ufuncs, different signatures, or if linearized, more >> > info on dimensions. >> >> This sentence no verb, but I think the point you might be raising is: >> we don't actually have the technical capability to define a single >> gufunc for @, because the matmat, matvec, vecmat, and vecvec forms >> have different gufunc signatures ("mn,nk->mk", "mn,n->m", "n,nk->k", >> and "n,n->" respectively, I think)? >> >> This is true, but Jaime has said he's willing to look at fixing this :-): >> >> http://thread.gmane.org/gmane.comp.python.numeric.general/58669/focus=58670 >> > > Don't see the need here, the loops are not complicated. > >> >> ...and fundamentally, it's very difficult to solve this anywhere else >> except the ufunc internals. When we first enter a function like >> newdot, we need to check for special overloads like __numpy_ufunc__ >> *before* we do any array_like->ndarray coercion. But we have to do >> array_like->ndarray coercion before we know what the shape/ndim of the >> our inputs is. > > > True. > >> >> And in the wrapper-around-multiple-ufuncs approach, we >> have to check the shape/ndim before we choose which ufunc to dispatch >> to. > > > True. I don't see a problem there and the four ufuncs would be useful in > themselves. I think they should be part of the multiarray module methods if > we go that way. Not sure what you mean about multiarray module methods -- it's kinda tricky to expose ufuncs from multiarray.so isn't it, b/c of the split between multiarray.so and umath.so? >> So __numpy_ufunc__ won't get checked until after we've already >> coerced to ndarray, oops... OTOH the ufunc internals already know how >> to do this dance, so it's at least straightforward (if not necessarily >> trivial) to insert the correct logic once in the correct place. > > > I'm thinking that the four ufuncs being overridden by user subtypes should > be sufficient. Maybe I don't understand what you're proposing. Suppose we get handed a random 3rd party type that has a __numpy_ufunc__ attribute, but we know nothing else about it. What do we do? Pick one of the 4 ufuncs at random and call it? >> >> >> > What you show is essentially what dot does now for cblas >> > enabled functions. But note, we need more than the simple '@', we also >> > need >> > stacks of vectors, and turning vectors into matrices, and then back into >> > vectors seems unnecessarily complicated. >> >> By the time we get into the data/shape/strides world that ufuncs work >> with, converting vectors->matrices is literally just adding a single >> entry to the shape+strides arrays. This feels trivial and cheap to me? > > > And moving things around, then removing. It's an option if we just want to > use matrix multiplication for everything. I don't think there is any speed > advantage one way or the other, although there are currently size > limitations that are easier to block in the 1D case than the matrix case. In > practice 2**31 per dimension for floating point is probably plenty. I guess we'll have to implement the 2d blocking sooner or later, and then it won't much matter whether the 1d blocking is simpler, because implementing *anything* for 1d blocking will still be more complicated than just using the 2d blocking code. (Assuming DGEMM is just as fast as DDOT/DGEMV, which seems likely.) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From antony.lee at berkeley.edu Fri Sep 12 01:54:37 2014 From: antony.lee at berkeley.edu (Antony Lee) Date: Thu, 11 Sep 2014 22:54:37 -0700 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce Message-ID: Hi, I thought that ufunc.reduce performs broadcasting, but it seems a bit confused by boolean arrays: In [1]: add.reduce([array([1, 2]), array([1])]) Out[1]: array([2, 3]) In [2]: logical_and.reduce([array([True, False], dtype=bool), array([True], dtype=bool)]) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 logical_and.reduce([array([True, False], dtype=bool), array([True], dtype=bool)]) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() Am I missing something here? Thanks, Antony -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Sep 12 03:48:06 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 12 Sep 2014 09:48:06 +0200 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: Message-ID: <1410508086.13211.1.camel@sebastian-t440> On Do, 2014-09-11 at 22:54 -0700, Antony Lee wrote: > Hi, > I thought that ufunc.reduce performs broadcasting, but it seems a bit > confused by boolean arrays: > > > In [1]: add.reduce([array([1, 2]), array([1])]) > Out[1]: array([2, 3]) > In [2]: logical_and.reduce([array([True, False], dtype=bool), > array([True], dtype=bool)]) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call > last) > in () > ----> 1 logical_and.reduce([array([True, False], dtype=bool), > array([True], dtype=bool)]) > > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > > Am I missing something here? > `np.asarray([array([1, 2]), array([1])])` is an object array, not a boolean array. You probably want to concatenate them. - Sebastian > Thanks, > Antony > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From antony.lee at berkeley.edu Fri Sep 12 05:04:53 2014 From: antony.lee at berkeley.edu (Antony Lee) Date: Fri, 12 Sep 2014 02:04:53 -0700 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: <1410508086.13211.1.camel@sebastian-t440> References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: I am not using asarray here. Sorry, but I don't see how this is relevant -- my comparison with np.add.reduce is simply that when a list of float arrays is passed to np.add.reduce, broadcasting happens as usual, but not when a list of bool arrays is passed to np.logical_and.reduce. 2014-09-12 0:48 GMT-07:00 Sebastian Berg : > On Do, 2014-09-11 at 22:54 -0700, Antony Lee wrote: > > Hi, > > I thought that ufunc.reduce performs broadcasting, but it seems a bit > > confused by boolean arrays: > > > > > > In [1]: add.reduce([array([1, 2]), array([1])]) > > Out[1]: array([2, 3]) > > In [2]: logical_and.reduce([array([True, False], dtype=bool), > > array([True], dtype=bool)]) > > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call > > last) > > in () > > ----> 1 logical_and.reduce([array([True, False], dtype=bool), > > array([True], dtype=bool)]) > > > > ValueError: The truth value of an array with more than one element is > > ambiguous. Use a.any() or a.all() > > > > Am I missing something here? > > > > `np.asarray([array([1, 2]), array([1])])` is an object array, not a > boolean array. You probably want to concatenate them. > > - Sebastian > > > > Thanks, > > Antony > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Sep 12 05:22:06 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 12 Sep 2014 10:22:06 +0100 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: On Fri, Sep 12, 2014 at 10:04 AM, Antony Lee wrote: > I am not using asarray here. Sorry, but I don't see how this is relevant -- > my comparison with np.add.reduce is simply that when a list of float arrays > is passed to np.add.reduce, broadcasting happens as usual, but not when a > list of bool arrays is passed to np.logical_and.reduce. But np.logical_and.reduce() *does* use asarray() when it is given a list object (all ufunc .reduce() methods do this). In both cases, you get a dtype=object array. This means that the ufunc will use the dtype=object inner loop, not the dtype=bool inner loop. For np.add, this isn't a problem. It just calls the __add__() method on the first object which, since it's an ndarray, calls np.add() again to do the actual work, this time using the appropriate dtype inner loop for the inner objects. But np.logical_and is different! For the dtype=object inner loop, it directly calls bool(x) on each item of the object array; it doesn't defer to any other method that might do the computation. bool(almost_any_ndarray) raises the ValueError that you saw. np.logical_and.reduce([x, y]) is not the same as np.logical_and(x, y). You can see how the dtype=object inner loop of np.logical_and() works by directly constructing dtype=object shape-() arrays: [~] |14> x array(None, dtype=object) [~] |15> x[()] = np.array([True, False]) [~] |16> x array(array([ True, False], dtype=bool), dtype=object) [~] |17> y = np.array(None, dtype=object) [~] |18> y[()] = np.array([[True], [False]]) [~] |19> y array(array([[ True], [False]], dtype=bool), dtype=object) [~] |20> np.logical_and(x, y) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 np.logical_and(x, y) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() -- Robert Kern From charlesr.harris at gmail.com Fri Sep 12 08:07:53 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 12 Sep 2014 06:07:53 -0600 Subject: [Numpy-discussion] @ operator In-Reply-To: References: Message-ID: On Thu, Sep 11, 2014 at 11:09 PM, Nathaniel Smith wrote: > On Thu, Sep 11, 2014 at 11:18 PM, Charles R Harris > wrote: > > > > On Thu, Sep 11, 2014 at 8:49 PM, Nathaniel Smith wrote: > >> > >> On Thu, Sep 11, 2014 at 10:12 PM, Charles R Harris > >> wrote: > >> > > >> > On Thu, Sep 11, 2014 at 8:01 PM, Nathaniel Smith > wrote: > >> >> > >> >> On Thu, Sep 11, 2014 at 12:10 PM, Charles R Harris > >> >> wrote: > >> >> > > >> >> > On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith > >> >> > wrote: > >> >> >> > >> >> >> My vote is: > >> >> >> > >> >> >> __matmul__/__rmatmul__ do the standard dispatch stuff that all > >> >> >> __op__ > >> >> >> methods do (so I guess check __array_priority__ or whatever it is > we > >> >> >> always do). I'd also be okay with ignoring __array_priority__ on > the > >> >> >> grounds that __numpy_ufunc__ is better, and there's no existing > code > >> >> >> relying on __array_priority__ support in __matmul__. > >> >> >> > >> >> >> Having decided that we are actually going to run, they dispatch > >> >> >> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the > >> >> >> in-place version), similarly to how e.g. __add__ dispatches to > >> >> >> np.add. > >> >> >> > >> >> >> newdot acts like a standard gufunc with all the standard niceties, > >> >> >> including __numpy_ufunc__ dispatch. > >> >> >> > >> >> >> ("newdot" here is intended as a placeholder name, maybe it should > be > >> >> >> np.linalg.matmul or something else to be bikeshed later. I also > vote > >> >> >> that eventually 'dot' become an alias for this function, but > whether > >> >> >> to do that is an orthogonal discussion for later.) > >> >> >> > >> >> > If we went the ufunc route, I think we would want three of them, > >> >> > matxvec, > >> >> > vecxmat, and matxmat, because the best inner loops would be > different > >> >> > in > >> >> > the > >> >> > three cases, > >> >> > >> >> Couldn't we write a single inner loop like: > >> >> > >> >> void ufunc_loop(blah blah) { > >> >> if (arg1_shape[0] == 1 && arg2_shape[1] == 1) { > >> >> call DOT > >> >> } else if (arg2_shape[0] == 1) { > >> >> call GEMV > >> >> } else if (...) { > >> >> ... > >> >> } else { > >> >> call GEMM > >> >> } > >> >> } > >> >> ? > >> > > >> > Not for generalized ufuncs, different signatures, or if linearized, > more > >> > info on dimensions. > >> > >> This sentence no verb, but I think the point you might be raising is: > >> we don't actually have the technical capability to define a single > >> gufunc for @, because the matmat, matvec, vecmat, and vecvec forms > >> have different gufunc signatures ("mn,nk->mk", "mn,n->m", "n,nk->k", > >> and "n,n->" respectively, I think)? > >> > >> This is true, but Jaime has said he's willing to look at fixing this > :-): > >> > >> > http://thread.gmane.org/gmane.comp.python.numeric.general/58669/focus=58670 > >> > > > > Don't see the need here, the loops are not complicated. > > > >> > >> ...and fundamentally, it's very difficult to solve this anywhere else > >> except the ufunc internals. When we first enter a function like > >> newdot, we need to check for special overloads like __numpy_ufunc__ > >> *before* we do any array_like->ndarray coercion. But we have to do > >> array_like->ndarray coercion before we know what the shape/ndim of the > >> our inputs is. > > > > > > True. > > > >> > >> And in the wrapper-around-multiple-ufuncs approach, we > >> have to check the shape/ndim before we choose which ufunc to dispatch > >> to. > > > > > > True. I don't see a problem there and the four ufuncs would be useful in > > themselves. I think they should be part of the multiarray module methods > if > > we go that way. > > Not sure what you mean about multiarray module methods -- it's kinda > tricky to expose ufuncs from multiarray.so isn't it, b/c of the split > between multiarray.so and umath.so? > Using the umath c-api is no worse than using the other c-api's. Using ufuncs from umath is trickier, IIRC multiarray initializes the ops that use them on load, but I'd define the new functions in multiarray, not umath. Generating the generalized functions can take place either during module load, or by using the static variable pattern on first function call, i.e., static ufunc myfunc = NULL; if (myfunc == NULL) { myfunc = make_ufunc(...); } I tend towards the latter as I want to use the functions in the multiarray c-api. The umath module uses the former. The vector dot functions are already available in the descr and can be passed to a generic loop (overhead), but what I would like to do at some point is bring all the loops together in their own file in multiarray. > > >> So __numpy_ufunc__ won't get checked until after we've already > >> coerced to ndarray, oops... OTOH the ufunc internals already know how > >> to do this dance, so it's at least straightforward (if not necessarily > >> trivial) to insert the correct logic once in the correct place. > > > > > > I'm thinking that the four ufuncs being overridden by user subtypes > should > > be sufficient. > > Maybe I don't understand what you're proposing. I'd like to have the four ufuncs vecvec, vecmat, matvec, and matmat in multiarray. Each of those can be overridden by subtypes using the __numpy_ufunc__ mechanism. Then matmul would, as you say, make the input arrays and then call the appropriate function depending on dimensions. Suppose we get handed > a random 3rd party type that has a __numpy_ufunc__ attribute, but we > know nothing else about it. What do we do? Pick one of the 4 ufuncs at > random and call it? > Every ufunc, when called, uses the numpy_ufunc mechanism to check for overrides, it is built into the ufunc. So a submodule would treat the four new functions like it treats any other ufunc. That would be one of the benefits of using ufuncs. The call chain would be matmul -> new_ufunc -> subclass, the subclass would not need to bother with matmul itself. > > >> > >> > >> > What you show is essentially what dot does now for cblas > >> > enabled functions. But note, we need more than the simple '@', we also > >> > need > >> > stacks of vectors, and turning vectors into matrices, and then back > into > >> > vectors seems unnecessarily complicated. > >> > >> By the time we get into the data/shape/strides world that ufuncs work > >> with, converting vectors->matrices is literally just adding a single > >> entry to the shape+strides arrays. This feels trivial and cheap to me? > > > > > > And moving things around, then removing. It's an option if we just want > to > > use matrix multiplication for everything. I don't think there is any > speed > > advantage one way or the other, although there are currently size > > limitations that are easier to block in the 1D case than the matrix > case. In > > practice 2**31 per dimension for floating point is probably plenty. > > I guess we'll have to implement the 2d blocking sooner or later, and > then it won't much matter whether the 1d blocking is simpler, because > implementing *anything* for 1d blocking will still be more complicated > than just using the 2d blocking code. (Assuming DGEMM is just as fast > as DDOT/DGEMV, which seems likely.) > Yeah, but that can be pushed off a while, at least until we have a working implementation. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.haessig at crans.org Fri Sep 12 08:56:58 2014 From: pierre.haessig at crans.org (Pierre Haesssig) Date: Fri, 12 Sep 2014 14:56:58 +0200 Subject: [Numpy-discussion] How to get docs for functions processed by numpy.vectorize? In-Reply-To: <5411C3DB.8020501@gmail.com> References: <5411C3DB.8020501@gmail.com> Message-ID: <5412ED9A.9070101@crans.org> Hi, I tried to following the calling logic behind the help function and I've arrived at the 3-4th level underground with the pydoc.render_doc function. Here some logic is inspecting the `thing` that getting documented. The fact that the docstring of the vectorize function is not read may relate to the fact that the type of a vectorized function is not `function` but `numpy.lib.function_base.vectorize` instead. Here is the code that does the magic, for futher inspection : pydoc.render_doc?? def render_doc(thing, title='Python Library Documentation: %s', forceload=0): """Render text documentation, given an object or a path to an object.""" object, name = resolve(thing, forceload) desc = describe(object) module = inspect.getmodule(object) if name and '.' in name: desc += ' in ' + name[:name.rfind('.')] elif module and module is not object: desc += ' in module ' + module.__name__ if type(object) is _OLD_INSTANCE_TYPE: # If the passed object is an instance of an old-style class, # document its available methods instead of its value. object = object.__class__ elif not (inspect.ismodule(object) or inspect.isclass(object) or inspect.isroutine(object) or inspect.isgetsetdescriptor(object) or inspect.ismemberdescriptor(object) or isinstance(object, property)): # If the passed object is a piece of data or an instance, # document its available methods instead of its value. object = type(object) desc += ' object' return title % desc + '\n\n' + text.document(object, name) -- Pierre From antony.lee at berkeley.edu Fri Sep 12 12:44:25 2014 From: antony.lee at berkeley.edu (Antony Lee) Date: Fri, 12 Sep 2014 09:44:25 -0700 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: I see. I went back to the documentation of ufunc.reduce and this is not explicitly mentioned although a posteriori it makes sense; perhaps this can be made clearer there? Antony 2014-09-12 2:22 GMT-07:00 Robert Kern : > On Fri, Sep 12, 2014 at 10:04 AM, Antony Lee > wrote: > > I am not using asarray here. Sorry, but I don't see how this is > relevant -- > > my comparison with np.add.reduce is simply that when a list of float > arrays > > is passed to np.add.reduce, broadcasting happens as usual, but not when a > > list of bool arrays is passed to np.logical_and.reduce. > > But np.logical_and.reduce() *does* use asarray() when it is given a > list object (all ufunc .reduce() methods do this). In both cases, you > get a dtype=object array. This means that the ufunc will use the > dtype=object inner loop, not the dtype=bool inner loop. For np.add, > this isn't a problem. It just calls the __add__() method on the first > object which, since it's an ndarray, calls np.add() again to do the > actual work, this time using the appropriate dtype inner loop for the > inner objects. But np.logical_and is different! For the dtype=object > inner loop, it directly calls bool(x) on each item of the object > array; it doesn't defer to any other method that might do the > computation. bool(almost_any_ndarray) raises the ValueError that you > saw. np.logical_and.reduce([x, y]) is not the same as > np.logical_and(x, y). You can see how the dtype=object inner loop of > np.logical_and() works by directly constructing dtype=object shape-() > arrays: > > [~] > |14> x > array(None, dtype=object) > > [~] > |15> x[()] = np.array([True, False]) > > [~] > |16> x > array(array([ True, False], dtype=bool), dtype=object) > > [~] > |17> y = np.array(None, dtype=object) > > [~] > |18> y[()] = np.array([[True], [False]]) > > [~] > |19> y > array(array([[ True], > [False]], dtype=bool), dtype=object) > > [~] > |20> np.logical_and(x, y) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > in () > ----> 1 np.logical_and(x, y) > > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Sep 12 12:46:56 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 12 Sep 2014 17:46:56 +0100 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: On Fri, Sep 12, 2014 at 5:44 PM, Antony Lee wrote: > I see. I went back to the documentation of ufunc.reduce and this is not > explicitly mentioned although a posteriori it makes sense; perhaps this can > be made clearer there? Please recommend the documentation you would like to see. -- Robert Kern From robert.kern at gmail.com Fri Sep 12 13:46:15 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 12 Sep 2014 18:46:15 +0100 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: On Fri, Sep 12, 2014 at 5:46 PM, Robert Kern wrote: > On Fri, Sep 12, 2014 at 5:44 PM, Antony Lee wrote: >> I see. I went back to the documentation of ufunc.reduce and this is not >> explicitly mentioned although a posteriori it makes sense; perhaps this can >> be made clearer there? > > Please recommend the documentation you would like to see. Specifically, the behavior I described is the interaction of several different things, but you don't mention which part of it "is not explicitly mentioned". -- Robert Kern From charlesr.harris at gmail.com Fri Sep 12 17:38:31 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 12 Sep 2014 15:38:31 -0600 Subject: [Numpy-discussion] Remove numpy/core/src/multiarray/testcalcs.py file Message-ID: Hi All, There is an old file, numpy/core/src/multiarray/testcalcs.py, that looks like a forgotten leftover from the original 2009 datetime work. Is there any reason that this file should not be removed? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From antony.lee at berkeley.edu Fri Sep 12 18:56:26 2014 From: antony.lee at berkeley.edu (Antony Lee) Date: Fri, 12 Sep 2014 15:56:26 -0700 Subject: [Numpy-discussion] Broadcasting with np.logical_and.reduce In-Reply-To: References: <1410508086.13211.1.camel@sebastian-t440> Message-ID: I read the "Methods" section of the ufuncs doc page ( http://docs.scipy.org/doc/numpy/reference/ufuncs.html#methods) again and I think this could be made clearer simply by replacing the first sentence from "All ufuncs have four methods." to "All ufuncs have five methods that operate on array-like objects." (yes, there's also "at", which seems to have been added later to the doc...) This would make it somewhat clearer that "logical_and.reduce([array([True, False], dtype=bool), array([True], dtype=bool)])" interprets the single list argument as an array-like (of dtype object) rather than as an iterable over which to reduce (as python's builtin reduce would). In fact there is another point in that paragraph that could be improved; namely "axis" does not have to be an integer for "reduce". Antony 2014-09-12 10:46 GMT-07:00 Robert Kern : > On Fri, Sep 12, 2014 at 5:46 PM, Robert Kern > wrote: > > On Fri, Sep 12, 2014 at 5:44 PM, Antony Lee > wrote: > >> I see. I went back to the documentation of ufunc.reduce and this is not > >> explicitly mentioned although a posteriori it makes sense; perhaps this > can > >> be made clearer there? > > > > Please recommend the documentation you would like to see. > > Specifically, the behavior I described is the interaction of several > different things, but you don't mention which part of it "is not > explicitly mentioned". > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rnelsonchem at gmail.com Sun Sep 14 22:22:41 2014 From: rnelsonchem at gmail.com (Ryan Nelson) Date: Sun, 14 Sep 2014 22:22:41 -0400 Subject: [Numpy-discussion] Question about broadcasting vs for loop performance Message-ID: Hello all, I have a question about the performance of broadcasting versus Python for loops. I have the following sample code that approximates some simulation I'd like to do: ## Test Code ## import numpy as np def lorentz(x, pos, inten, hwhm): return inten*( hwhm**2 / ( (x - pos)**2 + hwhm**2 ) ) poss = np.random.rand(100) intens = np.random.rand(100) xs = np.linspace(0,10,10000) def first_try(): sim_inten = np.zeros(xs.shape) for freq, inten in zip(poss, intens): sim_inten += lorentz(xs, freq, inten, 5.0) return sim_inten def second_try(): sim_inten2 = lorentz(xs.reshape((-1,1)), poss, intens, 5.0) sim_inten2 = sim_inten2.sum(axis=1) return sim_inten2 print np.array_equal(first_try(), second_try()) ## End Test ## Running this script prints "True" for the final equality test. However, IPython's %timeit magic, gives ~10 ms for first_try and ~30 ms for second_try. I tried this on Windows 7 (Anaconda Python) and on a Linux machine both with Python 2.7 and Numpy 1.8.2. I understand in principle why broadcasting should be faster than Python loops, but I'm wondering why I'm getting worse results with the pure Numpy function. Is there some general rules for when broadcasting might give worse performance than a Python loop? Thanks Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From rnelsonchem at gmail.com Sun Sep 14 22:53:13 2014 From: rnelsonchem at gmail.com (Ryan Nelson) Date: Sun, 14 Sep 2014 22:53:13 -0400 Subject: [Numpy-discussion] Question about broadcasting vs for loop performance In-Reply-To: References: Message-ID: I think I figured out my own question. I guess that the broadcasting approach is generating a very large 2D array in memory, which takes a bit of extra time. I gathered this from reading the last example on the following site: http://wiki.scipy.org/EricsBroadcastingDoc I tried this again with a much smaller "xs" array (~100 points) and the broadcasting version was much faster. Thanks Ryan Note: The link to the Scipy wiki page above is broken at the bottom of Numpy's broadcasting page, otherwise I would have seen that earlier. Sorry for the noise. On Sun, Sep 14, 2014 at 10:22 PM, Ryan Nelson wrote: > Hello all, > > I have a question about the performance of broadcasting versus Python for > loops. I have the following sample code that approximates some simulation > I'd like to do: > > ## Test Code ## > > import numpy as np > > > def lorentz(x, pos, inten, hwhm): > > return inten*( hwhm**2 / ( (x - pos)**2 + hwhm**2 ) ) > > > poss = np.random.rand(100) > > intens = np.random.rand(100) > > xs = np.linspace(0,10,10000) > > > def first_try(): > > sim_inten = np.zeros(xs.shape) > > for freq, inten in zip(poss, intens): > > sim_inten += lorentz(xs, freq, inten, 5.0) > > return sim_inten > > > def second_try(): > > sim_inten2 = lorentz(xs.reshape((-1,1)), poss, intens, 5.0) > > sim_inten2 = sim_inten2.sum(axis=1) > > return sim_inten2 > > > print np.array_equal(first_try(), second_try()) > > > ## End Test ## > > > Running this script prints "True" for the final equality test. However, > IPython's %timeit magic, gives ~10 ms for first_try and ~30 ms for > second_try. I tried this on Windows 7 (Anaconda Python) and on a Linux > machine both with Python 2.7 and Numpy 1.8.2. > > > I understand in principle why broadcasting should be faster than Python > loops, but I'm wondering why I'm getting worse results with the pure Numpy > function. Is there some general rules for when broadcasting might give > worse performance than a Python loop? > > > Thanks > > > Ryan > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mads.ipsen at gmail.com Mon Sep 15 04:16:26 2014 From: mads.ipsen at gmail.com (Mads Ipsen) Date: Mon, 15 Sep 2014 10:16:26 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects Message-ID: <5416A05A.2000905@gmail.com> Hi, I am trying to inspect the reference count of numpy arrays generated by my application. Initially, I thought I could inspect the tracked objects using gc.get_objects(), but, with respect to numpy objects, the returned container is empty. For example: import numpy import gc data = numpy.ones(1024).reshape((32,32)) objs = [o for o in gc.get_objects() if isinstance(o, numpy.ndarray)] print objs # Prints empty list print gc.is_tracked(data) # Print False Why is this? Also, is there some other technique I can use to inspect all numpy generated objects? Thanks in advance. Best regards, Mads -- +---------------------------------------------------------+ | Mads Ipsen | +----------------------+----------------------------------+ | G?seb?ksvej 7, 4. tv | phone: +45-29716388 | | DK-2500 Valby | email: mads.ipsen at gmail.com | | Denmark | map : www.tinyurl.com/ns52fpa | +----------------------+----------------------------------+ From sebastian at sipsolutions.net Mon Sep 15 05:48:48 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 15 Sep 2014 11:48:48 +0200 Subject: [Numpy-discussion] Linear algebra functions on empty arrays Message-ID: <1410774528.7696.9.camel@sebastian-t440> Hey all, for https://github.com/numpy/numpy/pull/3861/files I would like to allow 0-sized dimensions for generalized ufuncs, meaning that the gufunc has to be able to handle this, but also that it *can* handle it at all. However lapack does not support this, so it needs some explicit fixing. Also some of the linalg functions currently explicitly allow and others explicitly disallow empty arrays. For example the QR and eigvals does not allow it, but on the other hand solve explicitly does (most probably never did, simply because lapack does not). So I am wondering if there is some convention for this, or what convention we should implement. Regards, Sebastian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Mon Sep 15 05:55:12 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 15 Sep 2014 11:55:12 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: <5416A05A.2000905@gmail.com> References: <5416A05A.2000905@gmail.com> Message-ID: <1410774912.7696.13.camel@sebastian-t440> On Mo, 2014-09-15 at 10:16 +0200, Mads Ipsen wrote: > Hi, > > I am trying to inspect the reference count of numpy arrays generated by > my application. > > Initially, I thought I could inspect the tracked objects using > gc.get_objects(), but, with respect to numpy objects, the returned > container is empty. For example: > > import numpy > import gc > > data = numpy.ones(1024).reshape((32,32)) > > objs = [o for o in gc.get_objects() if isinstance(o, numpy.ndarray)] > > print objs # Prints empty list > print gc.is_tracked(data) # Print False > > Why is this? Also, is there some other technique I can use to inspect > all numpy generated objects? > Two reasons. First of all, unless your array is an object arrays (or a structured one with objects in it), there are no objects to track. The array is a single python object without any referenced objects (except possibly its `arr.base`). Second of all -- and this is an issue -- numpy doesn't actually implement the traverse slot, so it won't even work for object arrays (numpy object arrays do not support circular garbage collection at this time, please feel free to implement it ;)). - Sebastian > Thanks in advance. > > Best regards, > > Mads > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From hoogendoorn.eelco at gmail.com Mon Sep 15 06:05:17 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 15 Sep 2014 12:05:17 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: <1410774912.7696.13.camel@sebastian-t440> References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> Message-ID: On Mon, Sep 15, 2014 at 11:55 AM, Sebastian Berg wrote: > On Mo, 2014-09-15 at 10:16 +0200, Mads Ipsen wrote: > > Hi, > > > > I am trying to inspect the reference count of numpy arrays generated by > > my application. > > > > Initially, I thought I could inspect the tracked objects using > > gc.get_objects(), but, with respect to numpy objects, the returned > > container is empty. For example: > > > > import numpy > > import gc > > > > data = numpy.ones(1024).reshape((32,32)) > > > > objs = [o for o in gc.get_objects() if isinstance(o, numpy.ndarray)] > > > > print objs # Prints empty list > > print gc.is_tracked(data) # Print False > > > > Why is this? Also, is there some other technique I can use to inspect > > all numpy generated objects? > > > > Two reasons. First of all, unless your array is an object arrays (or a > structured one with objects in it), there are no objects to track. The > array is a single python object without any referenced objects (except > possibly its `arr.base`). > > Second of all -- and this is an issue -- numpy doesn't actually > implement the traverse slot, so it won't even work for object arrays > (numpy object arrays do not support circular garbage collection at this > time, please feel free to implement it ;)). > > - Sebastian > > Does this answer why the ndarray object itself isn't tracked though? I must say I find this puzzling; the only thing I can think of is that the python compiler notices that data isn't used anymore after its creation, and deletes it right after its creation as an optimization, but that conflicts with my own experience of the GC. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Sep 15 06:10:59 2014 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 15 Sep 2014 11:10:59 +0100 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> Message-ID: On Mon, Sep 15, 2014 at 11:05 AM, Eelco Hoogendoorn wrote: > Does this answer why the ndarray object itself isn't tracked though? I must > say I find this puzzling; the only thing I can think of is that the python > compiler notices that data isn't used anymore after its creation, and > deletes it right after its creation as an optimization, but that conflicts > with my own experience of the GC. The "gc" that "gc.is_tracked()" refers to is just the cyclical garbage detector, not the usual reference counting memory management that all Python objects participate in. If your object cannot participate in reference cycles (like most ndarrays), it doesn't need to be tracked. -- Robert Kern From sebastian at sipsolutions.net Mon Sep 15 06:11:55 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 15 Sep 2014 12:11:55 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> Message-ID: <1410775915.8320.1.camel@sebastian-t440> On Mo, 2014-09-15 at 12:05 +0200, Eelco Hoogendoorn wrote: > > > > On Mon, Sep 15, 2014 at 11:55 AM, Sebastian Berg > wrote: > On Mo, 2014-09-15 at 10:16 +0200, Mads Ipsen wrote: > > Hi, > > > > I am trying to inspect the reference count of numpy arrays > generated by > > my application. > > > > Initially, I thought I could inspect the tracked objects > using > > gc.get_objects(), but, with respect to numpy objects, the > returned > > container is empty. For example: > > > > import numpy > > import gc > > > > data = numpy.ones(1024).reshape((32,32)) > > > > objs = [o for o in gc.get_objects() if isinstance(o, > numpy.ndarray)] > > > > print objs # Prints empty list > > print gc.is_tracked(data) # Print False > > > > Why is this? Also, is there some other technique I can use > to inspect > > all numpy generated objects? > > > > Two reasons. First of all, unless your array is an object > arrays (or a > structured one with objects in it), there are no objects to > track. The > array is a single python object without any referenced objects > (except > possibly its `arr.base`). > > Second of all -- and this is an issue -- numpy doesn't > actually > implement the traverse slot, so it won't even work for object > arrays > (numpy object arrays do not support circular garbage > collection at this > time, please feel free to implement it ;)). > > - Sebastian > > > > > > > Does this answer why the ndarray object itself isn't tracked though? I > must say I find this puzzling; the only thing I can think of is that > the python compiler notices that data isn't used anymore after its > creation, and deletes it right after its creation as an optimization, > but that conflicts with my own experience of the GC. > > Not sure if it does, but my quick try and error says: In [15]: class T(tuple): ....: pass ....: In [16]: t = T() In [17]: objs = [o for o in gc.get_objects() if isinstance(o, T)] In [18]: objs Out[18]: [()] In [19]: a = 123. In [20]: objs = [o for o in gc.get_objects() if isinstance(o, float)] In [21]: objs Out[21]: [] So I guess nothing is tracked, unless it contains things, and numpy arrays don't say they can contain things (i.e. no traverse). - Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From mads.ipsen at gmail.com Mon Sep 15 07:06:15 2014 From: mads.ipsen at gmail.com (Mads Ipsen) Date: Mon, 15 Sep 2014 13:06:15 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: <1410775915.8320.1.camel@sebastian-t440> References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> <1410775915.8320.1.camel@sebastian-t440> Message-ID: <5416C827.4090605@gmail.com> Thanks to everybody for taking time to answer! Best regards, Mads On 15/09/14 12:11, Sebastian Berg wrote: > On Mo, 2014-09-15 at 12:05 +0200, Eelco Hoogendoorn wrote: >> >> >> >> On Mon, Sep 15, 2014 at 11:55 AM, Sebastian Berg >> wrote: >> On Mo, 2014-09-15 at 10:16 +0200, Mads Ipsen wrote: >> > Hi, >> > >> > I am trying to inspect the reference count of numpy arrays >> generated by >> > my application. >> > >> > Initially, I thought I could inspect the tracked objects >> using >> > gc.get_objects(), but, with respect to numpy objects, the >> returned >> > container is empty. For example: >> > >> > import numpy >> > import gc >> > >> > data = numpy.ones(1024).reshape((32,32)) >> > >> > objs = [o for o in gc.get_objects() if isinstance(o, >> numpy.ndarray)] >> > >> > print objs # Prints empty list >> > print gc.is_tracked(data) # Print False >> > >> > Why is this? Also, is there some other technique I can use >> to inspect >> > all numpy generated objects? >> > >> >> Two reasons. First of all, unless your array is an object >> arrays (or a >> structured one with objects in it), there are no objects to >> track. The >> array is a single python object without any referenced objects >> (except >> possibly its `arr.base`). >> >> Second of all -- and this is an issue -- numpy doesn't >> actually >> implement the traverse slot, so it won't even work for object >> arrays >> (numpy object arrays do not support circular garbage >> collection at this >> time, please feel free to implement it ;)). >> >> - Sebastian >> >> >> >> >> >> >> Does this answer why the ndarray object itself isn't tracked though? I >> must say I find this puzzling; the only thing I can think of is that >> the python compiler notices that data isn't used anymore after its >> creation, and deletes it right after its creation as an optimization, >> but that conflicts with my own experience of the GC. >> >> > > Not sure if it does, but my quick try and error says: > In [15]: class T(tuple): > ....: pass > ....: > > In [16]: t = T() > > In [17]: objs = [o for o in gc.get_objects() if isinstance(o, T)] > > In [18]: objs > Out[18]: [()] > > In [19]: a = 123. > > In [20]: objs = [o for o in gc.get_objects() if isinstance(o, float)] > > In [21]: objs > Out[21]: [] > > So I guess nothing is tracked, unless it contains things, and numpy > arrays don't say they can contain things (i.e. no traverse). > > - Sebastian > > > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- +---------------------------------------------------------+ | Mads Ipsen | +----------------------+----------------------------------+ | G?seb?ksvej 7, 4. tv | phone: +45-29716388 | | DK-2500 Valby | email: mads.ipsen at gmail.com | | Denmark | map : www.tinyurl.com/ns52fpa | +----------------------+----------------------------------+ From josef.pktd at gmail.com Mon Sep 15 07:07:11 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 15 Sep 2014 07:07:11 -0400 Subject: [Numpy-discussion] Linear algebra functions on empty arrays In-Reply-To: <1410774528.7696.9.camel@sebastian-t440> References: <1410774528.7696.9.camel@sebastian-t440> Message-ID: On Mon, Sep 15, 2014 at 5:48 AM, Sebastian Berg wrote: > Hey all, > > for https://github.com/numpy/numpy/pull/3861/files I would like to allow > 0-sized dimensions for generalized ufuncs, meaning that the gufunc has > to be able to handle this, but also that it *can* handle it at all. > However lapack does not support this, so it needs some explicit fixing. > Also some of the linalg functions currently explicitly allow and others > explicitly disallow empty arrays. > > For example the QR and eigvals does not allow it, but on the other hand > solve explicitly does (most probably never did, simply because lapack > does not). So I am wondering if there is some convention for this, or > what convention we should implement. What does an empty square matrix/array look like? np.linalg.solve can have empty rhs, but shape of empty lhs, `a`, is ? If I do a QR(arr) with arr.shape=(0, 5), what is R supposed to be ? I just wrote some loops over linalg.qr, but I always initialized explicitly. I didn't manage to figure out how empty arrays would be useful. If an empty square matrix can only only be of shape (0, 0), then it's no use (in my applications). Josef > > Regards, > > Sebastian > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From sebastian at sipsolutions.net Mon Sep 15 07:26:05 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 15 Sep 2014 13:26:05 +0200 Subject: [Numpy-discussion] Linear algebra functions on empty arrays In-Reply-To: References: <1410774528.7696.9.camel@sebastian-t440> Message-ID: <1410780365.8320.9.camel@sebastian-t440> On Mo, 2014-09-15 at 07:07 -0400, josef.pktd at gmail.com wrote: > On Mon, Sep 15, 2014 at 5:48 AM, Sebastian Berg > wrote: > > Hey all, > > > > for https://github.com/numpy/numpy/pull/3861/files I would like to allow > > 0-sized dimensions for generalized ufuncs, meaning that the gufunc has > > to be able to handle this, but also that it *can* handle it at all. > > However lapack does not support this, so it needs some explicit fixing. > > Also some of the linalg functions currently explicitly allow and others > > explicitly disallow empty arrays. > > > > For example the QR and eigvals does not allow it, but on the other hand > > solve explicitly does (most probably never did, simply because lapack > > does not). So I am wondering if there is some convention for this, or > > what convention we should implement. > > What does an empty square matrix/array look like? > > np.linalg.solve can have empty rhs, but shape of empty lhs, `a`, is ? > > If I do a QR(arr) with arr.shape=(0, 5), what is R supposed to be ? > QR may be more difficult since R may itself could not be empty, begging the question if you want to error out or fill it sensibly. Cholesky would require (0, 0) for example and for eigenvalues it would somewhat make sense too, the (0, 0) matrix has 0 eigenvalues. I did not go through them all, but I would like to figure out whether we should aim to generally allow it, or maybe just allow it for some special ones. - Sebastian > > I just wrote some loops over linalg.qr, but I always initialized explicitly. > > I didn't manage to figure out how empty arrays would be useful. > > If an empty square matrix can only only be of shape (0, 0), then it's > no use (in my applications). > > > Josef > > > > > > Regards, > > > > Sebastian > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From hoogendoorn.eelco at gmail.com Mon Sep 15 07:31:31 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Mon, 15 Sep 2014 13:31:31 +0200 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: <5416C827.4090605@gmail.com> References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> <1410775915.8320.1.camel@sebastian-t440> <5416C827.4090605@gmail.com> Message-ID: I figured the reference to the object through the local scope would also be tracked by the GC somehow, since the entire stack frame can be regarded as a separate object itself, but apparently not. On Mon, Sep 15, 2014 at 1:06 PM, Mads Ipsen wrote: > Thanks to everybody for taking time to answer! > > Best regards, > > Mads > > On 15/09/14 12:11, Sebastian Berg wrote: > > On Mo, 2014-09-15 at 12:05 +0200, Eelco Hoogendoorn wrote: > >> > >> > >> > >> On Mon, Sep 15, 2014 at 11:55 AM, Sebastian Berg > >> wrote: > >> On Mo, 2014-09-15 at 10:16 +0200, Mads Ipsen wrote: > >> > Hi, > >> > > >> > I am trying to inspect the reference count of numpy arrays > >> generated by > >> > my application. > >> > > >> > Initially, I thought I could inspect the tracked objects > >> using > >> > gc.get_objects(), but, with respect to numpy objects, the > >> returned > >> > container is empty. For example: > >> > > >> > import numpy > >> > import gc > >> > > >> > data = numpy.ones(1024).reshape((32,32)) > >> > > >> > objs = [o for o in gc.get_objects() if isinstance(o, > >> numpy.ndarray)] > >> > > >> > print objs # Prints empty list > >> > print gc.is_tracked(data) # Print False > >> > > >> > Why is this? Also, is there some other technique I can use > >> to inspect > >> > all numpy generated objects? > >> > > >> > >> Two reasons. First of all, unless your array is an object > >> arrays (or a > >> structured one with objects in it), there are no objects to > >> track. The > >> array is a single python object without any referenced objects > >> (except > >> possibly its `arr.base`). > >> > >> Second of all -- and this is an issue -- numpy doesn't > >> actually > >> implement the traverse slot, so it won't even work for object > >> arrays > >> (numpy object arrays do not support circular garbage > >> collection at this > >> time, please feel free to implement it ;)). > >> > >> - Sebastian > >> > >> > >> > >> > >> > >> > >> Does this answer why the ndarray object itself isn't tracked though? I > >> must say I find this puzzling; the only thing I can think of is that > >> the python compiler notices that data isn't used anymore after its > >> creation, and deletes it right after its creation as an optimization, > >> but that conflicts with my own experience of the GC. > >> > >> > > > > Not sure if it does, but my quick try and error says: > > In [15]: class T(tuple): > > ....: pass > > ....: > > > > In [16]: t = T() > > > > In [17]: objs = [o for o in gc.get_objects() if isinstance(o, T)] > > > > In [18]: objs > > Out[18]: [()] > > > > In [19]: a = 123. > > > > In [20]: objs = [o for o in gc.get_objects() if isinstance(o, float)] > > > > In [21]: objs > > Out[21]: [] > > > > So I guess nothing is tracked, unless it contains things, and numpy > > arrays don't say they can contain things (i.e. no traverse). > > > > - Sebastian > > > > > > > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > -- > +---------------------------------------------------------+ > | Mads Ipsen | > +----------------------+----------------------------------+ > | G?seb?ksvej 7, 4. tv | phone: +45-29716388 | > | DK-2500 Valby | email: mads.ipsen at gmail.com | > | Denmark | map : www.tinyurl.com/ns52fpa | > +----------------------+----------------------------------+ > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Sep 15 08:07:16 2014 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 15 Sep 2014 13:07:16 +0100 Subject: [Numpy-discussion] Tracking and inspecting numpy objects In-Reply-To: References: <5416A05A.2000905@gmail.com> <1410774912.7696.13.camel@sebastian-t440> <1410775915.8320.1.camel@sebastian-t440> <5416C827.4090605@gmail.com> Message-ID: On Mon, Sep 15, 2014 at 12:31 PM, Eelco Hoogendoorn wrote: > I figured the reference to the object through the local scope would also be > tracked by the GC somehow, since the entire stack frame can be regarded as a > separate object itself, but apparently not. Objects are "tracked", not references. An object is only considered "tracked" by the cyclic GC if the cyclic GC needs to traverse the object's referents when looking for cycles. It is not necessarily "tracked" just because it is referred to by an object that is "tracked". If an object does not refer to other objects, *OR* if such references cannot create a cycle for whatever reason, then the object does not need to be tracked by the GC. Most ndarrays fall in that second category. They do have references to their base ndarray, if a view, and their dtype object, but under most circumstances these cannot cause a cycle. As Sebastian mentioned, though, dtype=object arrays can create cycles and should be tracked. We need to implement the "tp_traverse" slot in order to do so. -- Robert Kern From josef.pktd at gmail.com Mon Sep 15 08:08:08 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 15 Sep 2014 08:08:08 -0400 Subject: [Numpy-discussion] Linear algebra functions on empty arrays In-Reply-To: <1410780365.8320.9.camel@sebastian-t440> References: <1410774528.7696.9.camel@sebastian-t440> <1410780365.8320.9.camel@sebastian-t440> Message-ID: On Mon, Sep 15, 2014 at 7:26 AM, Sebastian Berg wrote: > On Mo, 2014-09-15 at 07:07 -0400, josef.pktd at gmail.com wrote: >> On Mon, Sep 15, 2014 at 5:48 AM, Sebastian Berg >> wrote: >> > Hey all, >> > >> > for https://github.com/numpy/numpy/pull/3861/files I would like to allow >> > 0-sized dimensions for generalized ufuncs, meaning that the gufunc has >> > to be able to handle this, but also that it *can* handle it at all. >> > However lapack does not support this, so it needs some explicit fixing. >> > Also some of the linalg functions currently explicitly allow and others >> > explicitly disallow empty arrays. >> > >> > For example the QR and eigvals does not allow it, but on the other hand >> > solve explicitly does (most probably never did, simply because lapack >> > does not). So I am wondering if there is some convention for this, or >> > what convention we should implement. >> >> What does an empty square matrix/array look like? >> >> np.linalg.solve can have empty rhs, but shape of empty lhs, `a`, is ? >> >> If I do a QR(arr) with arr.shape=(0, 5), what is R supposed to be ? >> > > QR may be more difficult since R may itself could not be empty, begging > the question if you want to error out or fill it sensibly. I shouldn't have tried it again (I got this a few times last week): >>> ze = np.ones((z.shape[1], 0)) >>> np.linalg.qr(ze) ** On entry to DGEQRF parameter number 7 had an illegal value crash z.shape[1] is 3 >>> np.__version__ '1.6.1' I think, I would prefer an exception if the output would require a empty square matrix with shape > (0, 0) I don't see any useful fill value. > Cholesky would require (0, 0) for example and for eigenvalues it would > somewhat make sense too, the (0, 0) matrix has 0 eigenvalues. > I did not go through them all, but I would like to figure out whether we > should aim to generally allow it, or maybe just allow it for some > special ones. If the return square array has shape (0, 0), then it would make sense, but I haven't run into a case for it yet. np.cholesky(np.ones((0, 0))) ? (I didn't try since my interpreter is crashed. :) Josef > > - Sebastian > >> >> I just wrote some loops over linalg.qr, but I always initialized explicitly. >> >> I didn't manage to figure out how empty arrays would be useful. >> >> If an empty square matrix can only only be of shape (0, 0), then it's >> no use (in my applications). >> >> >> Josef >> >> >> > >> > Regards, >> > >> > Sebastian >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From ben.root at ou.edu Mon Sep 15 09:50:51 2014 From: ben.root at ou.edu (Benjamin Root) Date: Mon, 15 Sep 2014 09:50:51 -0400 Subject: [Numpy-discussion] Question about broadcasting vs for loop performance In-Reply-To: References: Message-ID: Broadcasting, by itself, should not be creating large arrays in memory. It uses stride tricks to make the array appear larger, while simply reusing the same memory block. This is why it is so valuable because it doesn't make a copy. Now, what may be happening is that the resulting calculation from the broadcasted arrays is too large to easily fit into the cpu cache, so the subsequent summation might be hitting performance penalties for that. Essentially, your first example may be a poor-man's implementation of data chunking. I bet if you ran these performance metrics over a wide range of sizes, you will see some interesting results. Cheers! Ben Root On Sun, Sep 14, 2014 at 10:53 PM, Ryan Nelson wrote: > I think I figured out my own question. I guess that the broadcasting > approach is generating a very large 2D array in memory, which takes a bit > of extra time. I gathered this from reading the last example on the > following site: > http://wiki.scipy.org/EricsBroadcastingDoc > I tried this again with a much smaller "xs" array (~100 points) and the > broadcasting version was much faster. > Thanks > > Ryan > > Note: The link to the Scipy wiki page above is broken at the bottom of > Numpy's broadcasting page, otherwise I would have seen that earlier. Sorry > for the noise. > > On Sun, Sep 14, 2014 at 10:22 PM, Ryan Nelson > wrote: > >> Hello all, >> >> I have a question about the performance of broadcasting versus Python for >> loops. I have the following sample code that approximates some simulation >> I'd like to do: >> >> ## Test Code ## >> >> import numpy as np >> >> >> def lorentz(x, pos, inten, hwhm): >> >> return inten*( hwhm**2 / ( (x - pos)**2 + hwhm**2 ) ) >> >> >> poss = np.random.rand(100) >> >> intens = np.random.rand(100) >> >> xs = np.linspace(0,10,10000) >> >> >> def first_try(): >> >> sim_inten = np.zeros(xs.shape) >> >> for freq, inten in zip(poss, intens): >> >> sim_inten += lorentz(xs, freq, inten, 5.0) >> >> return sim_inten >> >> >> def second_try(): >> >> sim_inten2 = lorentz(xs.reshape((-1,1)), poss, intens, 5.0) >> >> sim_inten2 = sim_inten2.sum(axis=1) >> >> return sim_inten2 >> >> >> print np.array_equal(first_try(), second_try()) >> >> >> ## End Test ## >> >> >> Running this script prints "True" for the final equality test. However, >> IPython's %timeit magic, gives ~10 ms for first_try and ~30 ms for >> second_try. I tried this on Windows 7 (Anaconda Python) and on a Linux >> machine both with Python 2.7 and Numpy 1.8.2. >> >> >> I understand in principle why broadcasting should be faster than Python >> loops, but I'm wondering why I'm getting worse results with the pure Numpy >> function. Is there some general rules for when broadcasting might give >> worse performance than a Python loop? >> >> >> Thanks >> >> >> Ryan >> >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 16 15:27:32 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 13:27:32 -0600 Subject: [Numpy-discussion] Is this a bug? Message-ID: Hi All, It turns out that gufuncs will broadcast the last dimension if it is one. For instance, inner1d has signature `(n), (n) -> ()`, yet In [27]: inner1d([1,1,1], [1]) Out[27]: 3 In [28]: inner1d([1,1,1], [1,1]) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 inner1d([1,1,1], [1,1]) ValueError: inner1d: Operand 1 has a mismatch in its core dimension 0, with gufunc signature (i),(i)->() (size 2 is different from 3) I'd think this is a bug, as the dimensions should match. Note that scalar 1 will be promoted to [1] in this case. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 16 15:39:01 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 13:39:01 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 1:27 PM, Charles R Harris wrote: > Hi All, > > It turns out that gufuncs will broadcast the last dimension if it is one. > For instance, inner1d has signature `(n), (n) -> ()`, yet > > In [27]: inner1d([1,1,1], [1]) > Out[27]: 3 > > In [28]: inner1d([1,1,1], [1,1]) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > in () > ----> 1 inner1d([1,1,1], [1,1]) > > ValueError: inner1d: Operand 1 has a mismatch in its core dimension 0, > with gufunc signature (i),(i)->() (size 2 is different from 3) > > > I'd think this is a bug, as the dimensions should match. Note that scalar > 1 will be promoted to [1] in this case. > > Thoughts? > > This also holds for matrix_multiply In [33]: matrix_multiply(eye(3), [[1]]) Out[33]: array([[ 1.], [ 1.], [ 1.]]) Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Sep 16 15:42:43 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 15:42:43 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 3:27 PM, Charles R Harris wrote: > Hi All, > > It turns out that gufuncs will broadcast the last dimension if it is one. > For instance, inner1d has signature `(n), (n) -> ()`, yet > > In [27]: inner1d([1,1,1], [1]) > Out[27]: 3 Yes, this looks totally wrong to me too... broadcasting is a feature of auto-vectorizing a core operation over a set of dimensions, it shouldn't be applied to the dimensions of the core operation itself like this. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From josef.pktd at gmail.com Tue Sep 16 15:55:41 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 16 Sep 2014 15:55:41 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 3:42 PM, Nathaniel Smith wrote: > On Tue, Sep 16, 2014 at 3:27 PM, Charles R Harris > wrote: >> Hi All, >> >> It turns out that gufuncs will broadcast the last dimension if it is one. >> For instance, inner1d has signature `(n), (n) -> ()`, yet >> >> In [27]: inner1d([1,1,1], [1]) >> Out[27]: 3 > > Yes, this looks totally wrong to me too... broadcasting is a feature > of auto-vectorizing a core operation over a set of dimensions, it > shouldn't be applied to the dimensions of the core operation itself > like this. Are these functions doing any numerical shortcuts in this case? If yes, this would be convenient. inner1d(x, weights) with weights is either (n, ) or () if weights == 1: return x.sum() else: return inner1d(x, weights) Josef > > -n > > -- > Nathaniel J. Smith > Postdoctoral researcher - Informatics - University of Edinburgh > http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From njs at pobox.com Tue Sep 16 16:04:08 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 16:04:08 -0400 Subject: [Numpy-discussion] Linear algebra functions on empty arrays In-Reply-To: <1410774528.7696.9.camel@sebastian-t440> References: <1410774528.7696.9.camel@sebastian-t440> Message-ID: On 15 Sep 2014 05:49, "Sebastian Berg" wrote: > For example the QR and eigvals does not allow it, but on the other hand > solve explicitly does (most probably never did, simply because lapack > does not). So I am wondering if there is some convention for this, or > what convention we should implement. To me the obvious convention would be that whenever there's a unique obvious answer that satisfies the operation's invariants, then we should prefer to implement it (though possibly with low priority), even if this means papering over lapack edge cases. This is consistent with how e.g. we already define sum([]) and prod([]) and empty matrix products, etc. Of course this requires some thinking... e.g. the empty matrix is a null matrix, b/c given empty_vec = np.ones((0,)) empty_mat = np.ones((0, 0)) then we have empty_vec @ empty_mat @ empty_vec = empty_vec @ empty_vec = sum([]) = 0 and therefore empty_mat is not positive definite. np.linalg.cholesky raises an error on non-positive-definite matrices in general (e.g. try np.linalg.cholesky(np.zeros((1, 1)))), so I guess cholesky shouldn't handle empty matrices. For eigvals, I guess empty_mat @ empty_vec = empty_vec, meaning that empty_vec is a arguably an eigenvector with some indeterminate eigenvalue? Or maybe the fact that scalar * empty_vec = empty_vec for ever scalar means that empty_vec should be counted as a zero vector, and thus be ineligible to be an eigenvector. Saying that the empty matrix has zero eigenvectors or eigenvalues seems pretty intuitive. I don't see any trouble with defining qr for empty matrices either. -n From charlesr.harris at gmail.com Tue Sep 16 16:10:33 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 14:10:33 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 1:55 PM, wrote: > On Tue, Sep 16, 2014 at 3:42 PM, Nathaniel Smith wrote: > > On Tue, Sep 16, 2014 at 3:27 PM, Charles R Harris > > wrote: > >> Hi All, > >> > >> It turns out that gufuncs will broadcast the last dimension if it is > one. > >> For instance, inner1d has signature `(n), (n) -> ()`, yet > >> > >> In [27]: inner1d([1,1,1], [1]) > >> Out[27]: 3 > > > > Yes, this looks totally wrong to me too... broadcasting is a feature > > of auto-vectorizing a core operation over a set of dimensions, it > > shouldn't be applied to the dimensions of the core operation itself > > like this. > > Are these functions doing any numerical shortcuts in this case? > > If yes, this would be convenient. > > inner1d(x, weights) with weights is either (n, ) or () > > if weights == 1: > return x.sum() > else: > return inner1d(x, weights) > > That depends on the inner inner loop ;) Currently inner1d inner loop multiplies and adds so not as efficient as a sum in the scalar case. However, it is probably faster than an if statement. In [4]: timeit inner1d(a, 1) 10000 loops, best of 3: 56.4 ?s per loop In [5]: timeit a.sum() 10000 loops, best of 3: 48.3 ?s per loop Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Sep 16 16:26:41 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 16:26:41 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 3:55 PM, wrote: > On Tue, Sep 16, 2014 at 3:42 PM, Nathaniel Smith wrote: >> On Tue, Sep 16, 2014 at 3:27 PM, Charles R Harris >> wrote: >>> Hi All, >>> >>> It turns out that gufuncs will broadcast the last dimension if it is one. >>> For instance, inner1d has signature `(n), (n) -> ()`, yet >>> >>> In [27]: inner1d([1,1,1], [1]) >>> Out[27]: 3 >> >> Yes, this looks totally wrong to me too... broadcasting is a feature >> of auto-vectorizing a core operation over a set of dimensions, it >> shouldn't be applied to the dimensions of the core operation itself >> like this. > > Are these functions doing any numerical shortcuts in this case? > > If yes, this would be convenient. > > inner1d(x, weights) with weights is either (n, ) or () > > if weights == 1: > return x.sum() > else: > return inner1d(x, weights) Yes, if this is the behaviour you want then I think you should write this if statement :-). This case isn't general enough to build directly into inner1d IMHO. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From jaime.frio at gmail.com Tue Sep 16 16:31:59 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 16 Sep 2014 13:31:59 -0700 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 12:27 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > Hi All, > > It turns out that gufuncs will broadcast the last dimension if it is one. > For instance, inner1d has signature `(n), (n) -> ()`, yet > > In [27]: inner1d([1,1,1], [1]) > Out[27]: 3 > > In [28]: inner1d([1,1,1], [1,1]) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > in () > ----> 1 inner1d([1,1,1], [1,1]) > > ValueError: inner1d: Operand 1 has a mismatch in its core dimension 0, > with gufunc signature (i),(i)->() (size 2 is different from 3) > > > I'd think this is a bug, as the dimensions should match. Note that scalar > 1 will be promoted to [1] in this case. > > Thoughts? > If it is a bug, it is an extended one, because it is the same behavior of einsum: >>> np.einsum('i,i', [1,1,1], [1]) 3 >>> np.einsum('i,i', [1,1,1], [1,1]) Traceback (most recent call last): File "", line 1, in ValueError: operands could not be broadcast together with remapped shapes [origi nal->remapped]: (3,)->(3,) (2,)->(2,) And I think it is a conscious design decision, there is a comment about broadcasting missing core dimensions here: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 and the code makes it very explicit that input argument dimensions with the same label are broadcast to a common shape, see here: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 I kind of expect numpy to broadcast whenever possible, so this doesn't feel wrong to me. That said, it is hard to come up with convincing examples of how this behavior would be useful in any practical context. But changing something that has been working like that for so long seems like a risky thing. And I cannot come with a convincing example of why it would be harmful either. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Sep 16 16:51:35 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 16:51:35 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o wrote: > If it is a bug, it is an extended one, because it is the same behavior of > einsum: > >>>> np.einsum('i,i', [1,1,1], [1]) > 3 >>>> np.einsum('i,i', [1,1,1], [1,1]) > Traceback (most recent call last): > File "", line 1, in > ValueError: operands could not be broadcast together with remapped shapes > [origi > nal->remapped]: (3,)->(3,) (2,)->(2,) > > And I think it is a conscious design decision, there is a comment about > broadcasting missing core dimensions here: > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 "intentional" and "sensible" are not always the same thing :-). That said, it isn't totally obvious to me what the correct behaviour for einsum is in this case. > and the code makes it very explicit that input argument dimensions with the > same label are broadcast to a common shape, see here: > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > > I kind of expect numpy to broadcast whenever possible, so this doesn't feel > wrong to me. The case Chuck is talking about is like if we allowed matrix multiplication between an array with shape (n, 1) with an array with (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY wrong to me, will certainly hide many bugs, and is definitely not how it works right now (for np.dot, anyway; apparently it does work that way for the brand-new gufunc np.linalg.matrix_multiply, but this must be an accident). > That said, it is hard to come up with convincing examples of how this > behavior would be useful in any practical context. But changing something > that has been working like that for so long seems like a risky thing. And I > cannot come with a convincing example of why it would be harmful either. gufuncs are very new. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From charlesr.harris at gmail.com Tue Sep 16 18:26:23 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 16:26:23 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: > On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o > wrote: > > If it is a bug, it is an extended one, because it is the same behavior of > > einsum: > > > >>>> np.einsum('i,i', [1,1,1], [1]) > > 3 > >>>> np.einsum('i,i', [1,1,1], [1,1]) > > Traceback (most recent call last): > > File "", line 1, in > > ValueError: operands could not be broadcast together with remapped shapes > > [origi > > nal->remapped]: (3,)->(3,) (2,)->(2,) > > > > And I think it is a conscious design decision, there is a comment about > > broadcasting missing core dimensions here: > > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 > > "intentional" and "sensible" are not always the same thing :-). That > said, it isn't totally obvious to me what the correct behaviour for > einsum is in this case. > > > and the code makes it very explicit that input argument dimensions with > the > > same label are broadcast to a common shape, see here: > > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > > > > I kind of expect numpy to broadcast whenever possible, so this doesn't > feel > > wrong to me. > > The case Chuck is talking about is like if we allowed matrix > multiplication between an array with shape (n, 1) with an array with > (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY > wrong to me, will certainly hide many bugs, and is definitely not how > it works right now (for np.dot, anyway; apparently it does work that > way for the brand-new gufunc np.linalg.matrix_multiply, but this must > be an accident). > > > That said, it is hard to come up with convincing examples of how this > > behavior would be useful in any practical context. But changing something > > that has been working like that for so long seems like a risky thing. > And I > > cannot come with a convincing example of why it would be harmful either. > > gufuncs are very new. > > Or at least newly used. They've been sitting around for years with little use and less testing. This is probably (easily?) fixable as the shape of the operands is available. In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] Out[22]: [(3,), (3, 3)] In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] Out[23]: [(1, 3), (3, 3)] Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Tue Sep 16 18:56:47 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 16 Sep 2014 15:56:47 -0700 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris wrote: > > > On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: > >> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o >> wrote: >> > If it is a bug, it is an extended one, because it is the same behavior >> of >> > einsum: >> > >> >>>> np.einsum('i,i', [1,1,1], [1]) >> > 3 >> >>>> np.einsum('i,i', [1,1,1], [1,1]) >> > Traceback (most recent call last): >> > File "", line 1, in >> > ValueError: operands could not be broadcast together with remapped >> shapes >> > [origi >> > nal->remapped]: (3,)->(3,) (2,)->(2,) >> > >> > And I think it is a conscious design decision, there is a comment about >> > broadcasting missing core dimensions here: >> > >> > >> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 >> >> "intentional" and "sensible" are not always the same thing :-). That >> said, it isn't totally obvious to me what the correct behaviour for >> einsum is in this case. >> >> > and the code makes it very explicit that input argument dimensions with >> the >> > same label are broadcast to a common shape, see here: >> > >> > >> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 >> > >> > I kind of expect numpy to broadcast whenever possible, so this doesn't >> feel >> > wrong to me. >> >> The case Chuck is talking about is like if we allowed matrix >> multiplication between an array with shape (n, 1) with an array with >> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY >> wrong to me, will certainly hide many bugs, and is definitely not how >> it works right now (for np.dot, anyway; apparently it does work that >> way for the brand-new gufunc np.linalg.matrix_multiply, but this must >> be an accident). >> >> > That said, it is hard to come up with convincing examples of how this >> > behavior would be useful in any practical context. But changing >> something >> > that has been working like that for so long seems like a risky thing. >> And I >> > cannot come with a convincing example of why it would be harmful either. >> >> gufuncs are very new. >> >> > Or at least newly used. They've been sitting around for years with little > use and less testing. This is probably (easily?) fixable as the shape of > the operands is available. > > In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] > Out[22]: [(3,), (3, 3)] > > In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] > Out[23]: [(1, 3), (3, 3)] > > If we agree that it is broken, which I still am not fully sure of, then yes, it is very easy to fix. I have been looking into that code quite a bit lately, so I could patch something up pretty quick. Are we OK with the appending of size 1 dimensions to complete the core dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) work, or should it complain about the first argument having less dimensions than the core dimensions in the signature? Lastly, there is an interesting side effect of the way this broadcasting is handled: if a gufunc specifies a core dimension in an output argument only, and an `out` kwarg is not passed in, then the output array will have that core dimension set to be of size 1, e.g. if the signature of `f` is '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to me, and I think that an unspecified dimension in an output array should either be specified by a passed out array, or raise an error about an unspecified core dimension or something like that. Does this sound right? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewm at redtetrahedron.org Tue Sep 16 19:03:27 2014 From: ewm at redtetrahedron.org (Eric Moore) Date: Tue, 16 Sep 2014 19:03:27 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tuesday, September 16, 2014, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris < > charlesr.harris at gmail.com > > wrote: > >> >> >> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith > > wrote: >> >>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o >>> >> > wrote: >>> > If it is a bug, it is an extended one, because it is the same behavior >>> of >>> > einsum: >>> > >>> >>>> np.einsum('i,i', [1,1,1], [1]) >>> > 3 >>> >>>> np.einsum('i,i', [1,1,1], [1,1]) >>> > Traceback (most recent call last): >>> > File "", line 1, in >>> > ValueError: operands could not be broadcast together with remapped >>> shapes >>> > [origi >>> > nal->remapped]: (3,)->(3,) (2,)->(2,) >>> > >>> > And I think it is a conscious design decision, there is a comment about >>> > broadcasting missing core dimensions here: >>> > >>> > >>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 >>> >>> "intentional" and "sensible" are not always the same thing :-). That >>> said, it isn't totally obvious to me what the correct behaviour for >>> einsum is in this case. >>> >>> > and the code makes it very explicit that input argument dimensions >>> with the >>> > same label are broadcast to a common shape, see here: >>> > >>> > >>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 >>> > >>> > I kind of expect numpy to broadcast whenever possible, so this doesn't >>> feel >>> > wrong to me. >>> >>> The case Chuck is talking about is like if we allowed matrix >>> multiplication between an array with shape (n, 1) with an array with >>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY >>> wrong to me, will certainly hide many bugs, and is definitely not how >>> it works right now (for np.dot, anyway; apparently it does work that >>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must >>> be an accident). >>> >>> > That said, it is hard to come up with convincing examples of how this >>> > behavior would be useful in any practical context. But changing >>> something >>> > that has been working like that for so long seems like a risky thing. >>> And I >>> > cannot come with a convincing example of why it would be harmful >>> either. >>> >>> gufuncs are very new. >>> >>> >> Or at least newly used. They've been sitting around for years with little >> use and less testing. This is probably (easily?) fixable as the shape of >> the operands is available. >> >> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] >> Out[22]: [(3,), (3, 3)] >> >> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] >> Out[23]: [(1, 3), (3, 3)] >> >> > If we agree that it is broken, which I still am not fully sure of, then > yes, it is very easy to fix. I have been looking into that code quite a bit > lately, so I could patch something up pretty quick. > > Are we OK with the appending of size 1 dimensions to complete the core > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) work, > or should it complain about the first argument having less dimensions than > the core dimensions in the signature? > > Lastly, there is an interesting side effect of the way this broadcasting > is handled: if a gufunc specifies a core dimension in an output argument > only, and an `out` kwarg is not passed in, then the output array will have > that core dimension set to be of size 1, e.g. if the signature of `f` is > '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to me, > and I think that an unspecified dimension in an output array should either > be specified by a passed out array, or raise an error about an unspecified > core dimension or something like that. Does this sound right? > > Jaime > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > Given this and the earlier discussion about improvements to this code, I wonder if it wouldn't be worth implemented the logic in python first. This way there is something to test against, and something to play while all of the cases are sorted out. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 16 19:07:02 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 17:07:02 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 4:56 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: >> >>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o >>> wrote: >>> > If it is a bug, it is an extended one, because it is the same behavior >>> of >>> > einsum: >>> > >>> >>>> np.einsum('i,i', [1,1,1], [1]) >>> > 3 >>> >>>> np.einsum('i,i', [1,1,1], [1,1]) >>> > Traceback (most recent call last): >>> > File "", line 1, in >>> > ValueError: operands could not be broadcast together with remapped >>> shapes >>> > [origi >>> > nal->remapped]: (3,)->(3,) (2,)->(2,) >>> > >>> > And I think it is a conscious design decision, there is a comment about >>> > broadcasting missing core dimensions here: >>> > >>> > >>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 >>> >>> "intentional" and "sensible" are not always the same thing :-). That >>> said, it isn't totally obvious to me what the correct behaviour for >>> einsum is in this case. >>> >>> > and the code makes it very explicit that input argument dimensions >>> with the >>> > same label are broadcast to a common shape, see here: >>> > >>> > >>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 >>> > >>> > I kind of expect numpy to broadcast whenever possible, so this doesn't >>> feel >>> > wrong to me. >>> >>> The case Chuck is talking about is like if we allowed matrix >>> multiplication between an array with shape (n, 1) with an array with >>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY >>> wrong to me, will certainly hide many bugs, and is definitely not how >>> it works right now (for np.dot, anyway; apparently it does work that >>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must >>> be an accident). >>> >>> > That said, it is hard to come up with convincing examples of how this >>> > behavior would be useful in any practical context. But changing >>> something >>> > that has been working like that for so long seems like a risky thing. >>> And I >>> > cannot come with a convincing example of why it would be harmful >>> either. >>> >>> gufuncs are very new. >>> >>> >> Or at least newly used. They've been sitting around for years with little >> use and less testing. This is probably (easily?) fixable as the shape of >> the operands is available. >> >> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] >> Out[22]: [(3,), (3, 3)] >> >> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] >> Out[23]: [(1, 3), (3, 3)] >> >> > If we agree that it is broken, which I still am not fully sure of, then > yes, it is very easy to fix. I have been looking into that code quite a bit > lately, so I could patch something up pretty quick. > That would be nice... I've been starting to look through the code and didn't relish it. > > Are we OK with the appending of size 1 dimensions to complete the core > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) work, > or should it complain about the first argument having less dimensions than > the core dimensions in the signature? > Yes, I think we need to keep that part. It is even essential ;) > Lastly, there is an interesting side effect of the way this broadcasting > is handled: if a gufunc specifies a core dimension in an output argument > only, and an `out` kwarg is not passed in, then the output array will have > that core dimension set to be of size 1, e.g. if the signature of `f` is > '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to me, > and I think that an unspecified dimension in an output array should either > be specified by a passed out array, or raise an error about an unspecified > core dimension or something like that. Does this sound right? > Uh, I need to get my head around that before commenting. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 16 19:11:07 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 16 Sep 2014 17:11:07 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 5:03 PM, Eric Moore wrote: > > > On Tuesday, September 16, 2014, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: >>> >>>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o >>>> wrote: >>>> > If it is a bug, it is an extended one, because it is the same >>>> behavior of >>>> > einsum: >>>> > >>>> >>>> np.einsum('i,i', [1,1,1], [1]) >>>> > 3 >>>> >>>> np.einsum('i,i', [1,1,1], [1,1]) >>>> > Traceback (most recent call last): >>>> > File "", line 1, in >>>> > ValueError: operands could not be broadcast together with remapped >>>> shapes >>>> > [origi >>>> > nal->remapped]: (3,)->(3,) (2,)->(2,) >>>> > >>>> > And I think it is a conscious design decision, there is a comment >>>> about >>>> > broadcasting missing core dimensions here: >>>> > >>>> > >>>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 >>>> >>>> "intentional" and "sensible" are not always the same thing :-). That >>>> said, it isn't totally obvious to me what the correct behaviour for >>>> einsum is in this case. >>>> >>>> > and the code makes it very explicit that input argument dimensions >>>> with the >>>> > same label are broadcast to a common shape, see here: >>>> > >>>> > >>>> https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 >>>> > >>>> > I kind of expect numpy to broadcast whenever possible, so this >>>> doesn't feel >>>> > wrong to me. >>>> >>>> The case Chuck is talking about is like if we allowed matrix >>>> multiplication between an array with shape (n, 1) with an array with >>>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY >>>> wrong to me, will certainly hide many bugs, and is definitely not how >>>> it works right now (for np.dot, anyway; apparently it does work that >>>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must >>>> be an accident). >>>> >>>> > That said, it is hard to come up with convincing examples of how this >>>> > behavior would be useful in any practical context. But changing >>>> something >>>> > that has been working like that for so long seems like a risky thing. >>>> And I >>>> > cannot come with a convincing example of why it would be harmful >>>> either. >>>> >>>> gufuncs are very new. >>>> >>>> >>> Or at least newly used. They've been sitting around for years with >>> little use and less testing. This is probably (easily?) fixable as the >>> shape of the operands is available. >>> >>> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] >>> Out[22]: [(3,), (3, 3)] >>> >>> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] >>> Out[23]: [(1, 3), (3, 3)] >>> >>> >> If we agree that it is broken, which I still am not fully sure of, then >> yes, it is very easy to fix. I have been looking into that code quite a bit >> lately, so I could patch something up pretty quick. >> >> Are we OK with the appending of size 1 dimensions to complete the core >> dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) work, >> or should it complain about the first argument having less dimensions than >> the core dimensions in the signature? >> >> Lastly, there is an interesting side effect of the way this broadcasting >> is handled: if a gufunc specifies a core dimension in an output argument >> only, and an `out` kwarg is not passed in, then the output array will have >> that core dimension set to be of size 1, e.g. if the signature of `f` is >> '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to me, >> and I think that an unspecified dimension in an output array should either >> be specified by a passed out array, or raise an error about an unspecified >> core dimension or something like that. Does this sound right? >> >> Jaime >> >> -- >> (\__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes >> de dominaci?n mundial. >> > > Given this and the earlier discussion about improvements to this code, I > wonder if it wouldn't be worth implemented the logic in python first. This > way there is something to test against, and something to play while all of > the cases are sorted out. > I've got a couple of generalized functions whose tests turned this up. Speaking of which, they are tentatively named mulvecvec mulvecmat mulmatvec mulmatmat and work on stacked matrices and vectors. I can see using 'dot' instead of 'mul', and any other suggestions would be welcome. I've also made it easier to specify generalized functions in the code generator, but given the multiple loops, I haven't settled on a good way of using generic loops. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Sep 16 19:32:32 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 19:32:32 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 6:56 PM, Jaime Fern?ndez del R?o wrote: > On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris > wrote: >> >> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: >>> >>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o >>> wrote: >>> > If it is a bug, it is an extended one, because it is the same behavior >>> > of >>> > einsum: >>> > >>> >>>> np.einsum('i,i', [1,1,1], [1]) >>> > 3 >>> >>>> np.einsum('i,i', [1,1,1], [1,1]) >>> > Traceback (most recent call last): >>> > File "", line 1, in >>> > ValueError: operands could not be broadcast together with remapped >>> > shapes >>> > [origi >>> > nal->remapped]: (3,)->(3,) (2,)->(2,) >>> > >>> > And I think it is a conscious design decision, there is a comment about >>> > broadcasting missing core dimensions here: >>> > >>> > >>> > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 >>> >>> "intentional" and "sensible" are not always the same thing :-). That >>> said, it isn't totally obvious to me what the correct behaviour for >>> einsum is in this case. >>> >>> > and the code makes it very explicit that input argument dimensions with >>> > the >>> > same label are broadcast to a common shape, see here: >>> > >>> > >>> > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 >>> > >>> > I kind of expect numpy to broadcast whenever possible, so this doesn't >>> > feel >>> > wrong to me. >>> >>> The case Chuck is talking about is like if we allowed matrix >>> multiplication between an array with shape (n, 1) with an array with >>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY >>> wrong to me, will certainly hide many bugs, and is definitely not how >>> it works right now (for np.dot, anyway; apparently it does work that >>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must >>> be an accident). >>> >>> > That said, it is hard to come up with convincing examples of how this >>> > behavior would be useful in any practical context. But changing >>> > something >>> > that has been working like that for so long seems like a risky thing. >>> > And I >>> > cannot come with a convincing example of why it would be harmful >>> > either. >>> >>> gufuncs are very new. >>> >> >> Or at least newly used. They've been sitting around for years with little >> use and less testing. This is probably (easily?) fixable as the shape of the >> operands is available. >> >> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] >> Out[22]: [(3,), (3, 3)] >> >> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] >> Out[23]: [(1, 3), (3, 3)] >> > > If we agree that it is broken, which I still am not fully sure of, then yes, > it is very easy to fix. I have been looking into that code quite a bit > lately, so I could patch something up pretty quick. > > Are we OK with the appending of size 1 dimensions to complete the core > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) work, or > should it complain about the first argument having less dimensions than the > core dimensions in the signature? I think that by default, gufuncs should definitely *not* allow this. Example case 1: qr can be applied equally well to a (1, n) array or an (n, 1) array, but with different results. If the user passes in an (n,) array, then how do we know which one they wanted? Example case 2: matrix multiplication, as you know :-), is a case where I do think we should allow for a bit more cleverness with the core dimensions... but the appropriate cleverness is much more subtle than just "prepend size 1 dimensions until things fit". Instead, for the first argument you need to prepend, for the second argument you need to append, and then you need to remove the corresponding dimensions from the output. Specific cases: # Your version gives: matmul([1, 1, 1], [[1], [1], [1]]).shape == (1, 1) # But this should be (1,) (try it with np.dot) # Your version gives: matmul([[1, 1, 1]], [1, 1, 1]) -> error, (1, 3) and (1, 3) are not conformable # But this should work (second argument should be treated as (3, 1), not (1, 3)) So the default should be to be strict about core dimensions, unless explicitly requested otherwise by the person defining the gufunc. > Lastly, there is an interesting side effect of the way this broadcasting is > handled: if a gufunc specifies a core dimension in an output argument only, > and an `out` kwarg is not passed in, then the output array will have that > core dimension set to be of size 1, e.g. if the signature of `f` is > '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to me, > and I think that an unspecified dimension in an output array should either > be specified by a passed out array, or raise an error about an unspecified > core dimension or something like that. Does this sound right? Does this have any use cases? My vote is that we simply disallow this until we have concrete uses and can decide how to do it properly. That way there won't be any backcompat concerns to deal with later. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From jaime.frio at gmail.com Tue Sep 16 20:31:01 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 16 Sep 2014 17:31:01 -0700 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 4:32 PM, Nathaniel Smith wrote: > On Tue, Sep 16, 2014 at 6:56 PM, Jaime Fern?ndez del R?o > wrote: > > On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris > > wrote: > >> > >> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: > >>> > >>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o > >>> wrote: > >>> > If it is a bug, it is an extended one, because it is the same > behavior > >>> > of > >>> > einsum: > >>> > > >>> >>>> np.einsum('i,i', [1,1,1], [1]) > >>> > 3 > >>> >>>> np.einsum('i,i', [1,1,1], [1,1]) > >>> > Traceback (most recent call last): > >>> > File "", line 1, in > >>> > ValueError: operands could not be broadcast together with remapped > >>> > shapes > >>> > [origi > >>> > nal->remapped]: (3,)->(3,) (2,)->(2,) > >>> > > >>> > And I think it is a conscious design decision, there is a comment > about > >>> > broadcasting missing core dimensions here: > >>> > > >>> > > >>> > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 > >>> > >>> "intentional" and "sensible" are not always the same thing :-). That > >>> said, it isn't totally obvious to me what the correct behaviour for > >>> einsum is in this case. > >>> > >>> > and the code makes it very explicit that input argument dimensions > with > >>> > the > >>> > same label are broadcast to a common shape, see here: > >>> > > >>> > > >>> > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > >>> > > >>> > I kind of expect numpy to broadcast whenever possible, so this > doesn't > >>> > feel > >>> > wrong to me. > >>> > >>> The case Chuck is talking about is like if we allowed matrix > >>> multiplication between an array with shape (n, 1) with an array with > >>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY > >>> wrong to me, will certainly hide many bugs, and is definitely not how > >>> it works right now (for np.dot, anyway; apparently it does work that > >>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must > >>> be an accident). > >>> > >>> > That said, it is hard to come up with convincing examples of how this > >>> > behavior would be useful in any practical context. But changing > >>> > something > >>> > that has been working like that for so long seems like a risky thing. > >>> > And I > >>> > cannot come with a convincing example of why it would be harmful > >>> > either. > >>> > >>> gufuncs are very new. > >>> > >> > >> Or at least newly used. They've been sitting around for years with > little > >> use and less testing. This is probably (easily?) fixable as the shape > of the > >> operands is available. > >> > >> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] > >> Out[22]: [(3,), (3, 3)] > >> > >> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] > >> Out[23]: [(1, 3), (3, 3)] > >> > > > > If we agree that it is broken, which I still am not fully sure of, then > yes, > > it is very easy to fix. I have been looking into that code quite a bit > > lately, so I could patch something up pretty quick. > > > > Are we OK with the appending of size 1 dimensions to complete the core > > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) > work, or > > should it complain about the first argument having less dimensions than > the > > core dimensions in the signature? > > I think that by default, gufuncs should definitely *not* allow this. > Too late! ;-) I just put together some working code and sent a PR implementing the behavior that Charles asked for: https://github.com/numpy/numpy/pull/5077 Should we keep the discussion here, or take it over there? Jaime > > Example case 1: qr can be applied equally well to a (1, n) array or an > (n, 1) array, but with different results. If the user passes in an > (n,) array, then how do we know which one they wanted? > > Example case 2: matrix multiplication, as you know :-), is a case > where I do think we should allow for a bit more cleverness with the > core dimensions... but the appropriate cleverness is much more subtle > than just "prepend size 1 dimensions until things fit". Instead, for > the first argument you need to prepend, for the second argument you > need to append, and then you need to remove the corresponding > dimensions from the output. Specific cases: > > # Your version gives: > matmul([1, 1, 1], [[1], [1], [1]]).shape == (1, 1) > # But this should be (1,) (try it with np.dot) > > # Your version gives: > matmul([[1, 1, 1]], [1, 1, 1]) -> error, (1, 3) and (1, 3) are not > conformable > # But this should work (second argument should be treated as (3, 1), not > (1, 3)) > > So the default should be to be strict about core dimensions, unless > explicitly requested otherwise by the person defining the gufunc. > > > Lastly, there is an interesting side effect of the way this broadcasting > is > > handled: if a gufunc specifies a core dimension in an output argument > only, > > and an `out` kwarg is not passed in, then the output array will have that > > core dimension set to be of size 1, e.g. if the signature of `f` is > > '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to > me, > > and I think that an unspecified dimension in an output array should > either > > be specified by a passed out array, or raise an error about an > unspecified > > core dimension or something like that. Does this sound right? > > Does this have any use cases? My vote is that we simply disallow this > until we have concrete uses and can decide how to do it properly. That > way there won't be any backcompat concerns to deal with later. > > -n > > -- > Nathaniel J. Smith > Postdoctoral researcher - Informatics - University of Edinburgh > http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Sep 16 21:06:56 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 16 Sep 2014 21:06:56 -0400 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 8:31 PM, Jaime Fern?ndez del R?o wrote: > On Tue, Sep 16, 2014 at 4:32 PM, Nathaniel Smith wrote: >> >> On Tue, Sep 16, 2014 at 6:56 PM, Jaime Fern?ndez del R?o >> wrote: >> > Are we OK with the appending of size 1 dimensions to complete the core >> > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) >> > work, or >> > should it complain about the first argument having less dimensions than >> > the >> > core dimensions in the signature? >> >> I think that by default, gufuncs should definitely *not* allow this. > > Too late! ;-) > > I just put together some working code and sent a PR implementing the > behavior that Charles asked for: > > https://github.com/numpy/numpy/pull/5077 > > Should we keep the discussion here, or take it over there? I guess the default is, design discussions here where people can chime in, code finickiness over there to avoid boring people? So that would suggest keeping the discussion here until we've resolved the high-level debate about what the behaviour should even be. But it isn't a huge issue either way... -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From jaime.frio at gmail.com Wed Sep 17 01:51:10 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 16 Sep 2014 22:51:10 -0700 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 4:32 PM, Nathaniel Smith wrote: > On Tue, Sep 16, 2014 at 6:56 PM, Jaime Fern?ndez del R?o > wrote: > > On Tue, Sep 16, 2014 at 3:26 PM, Charles R Harris > > wrote: > >> > >> On Tue, Sep 16, 2014 at 2:51 PM, Nathaniel Smith wrote: > >>> > >>> On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o > >>> wrote: > >>> > If it is a bug, it is an extended one, because it is the same > behavior > >>> > of > >>> > einsum: > >>> > > >>> >>>> np.einsum('i,i', [1,1,1], [1]) > >>> > 3 > >>> >>>> np.einsum('i,i', [1,1,1], [1,1]) > >>> > Traceback (most recent call last): > >>> > File "", line 1, in > >>> > ValueError: operands could not be broadcast together with remapped > >>> > shapes > >>> > [origi > >>> > nal->remapped]: (3,)->(3,) (2,)->(2,) > >>> > > >>> > And I think it is a conscious design decision, there is a comment > about > >>> > broadcasting missing core dimensions here: > >>> > > >>> > > >>> > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 > >>> > >>> "intentional" and "sensible" are not always the same thing :-). That > >>> said, it isn't totally obvious to me what the correct behaviour for > >>> einsum is in this case. > >>> > >>> > and the code makes it very explicit that input argument dimensions > with > >>> > the > >>> > same label are broadcast to a common shape, see here: > >>> > > >>> > > >>> > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > >>> > > >>> > I kind of expect numpy to broadcast whenever possible, so this > doesn't > >>> > feel > >>> > wrong to me. > >>> > >>> The case Chuck is talking about is like if we allowed matrix > >>> multiplication between an array with shape (n, 1) with an array with > >>> (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY > >>> wrong to me, will certainly hide many bugs, and is definitely not how > >>> it works right now (for np.dot, anyway; apparently it does work that > >>> way for the brand-new gufunc np.linalg.matrix_multiply, but this must > >>> be an accident). > >>> > >>> > That said, it is hard to come up with convincing examples of how this > >>> > behavior would be useful in any practical context. But changing > >>> > something > >>> > that has been working like that for so long seems like a risky thing. > >>> > And I > >>> > cannot come with a convincing example of why it would be harmful > >>> > either. > >>> > >>> gufuncs are very new. > >>> > >> > >> Or at least newly used. They've been sitting around for years with > little > >> use and less testing. This is probably (easily?) fixable as the shape > of the > >> operands is available. > >> > >> In [22]: [d.shape for d in nditer([[1,1,1], [[1,1,1]]*3]).operands] > >> Out[22]: [(3,), (3, 3)] > >> > >> In [23]: [d.shape for d in nditer([[[1,1,1]], [[1,1,1]]*3]).operands] > >> Out[23]: [(1, 3), (3, 3)] > >> > > > > If we agree that it is broken, which I still am not fully sure of, then > yes, > > it is very easy to fix. I have been looking into that code quite a bit > > lately, so I could patch something up pretty quick. > > > > Are we OK with the appending of size 1 dimensions to complete the core > > dimensions? That is, should matrix_multiply([1,1,1], [[1],[1],[1]]) > work, or > > should it complain about the first argument having less dimensions than > the > > core dimensions in the signature? > > I think that by default, gufuncs should definitely *not* allow this. > > Example case 1: qr can be applied equally well to a (1, n) array or an > (n, 1) array, but with different results. If the user passes in an > (n,) array, then how do we know which one they wanted? > > Example case 2: matrix multiplication, as you know :-), is a case > where I do think we should allow for a bit more cleverness with the > core dimensions... but the appropriate cleverness is much more subtle > than just "prepend size 1 dimensions until things fit". Instead, for > the first argument you need to prepend, for the second argument you > need to append, and then you need to remove the corresponding > dimensions from the output. Specific cases: > > # Your version gives: > matmul([1, 1, 1], [[1], [1], [1]]).shape == (1, 1) > # But this should be (1,) (try it with np.dot) > > # Your version gives: > matmul([[1, 1, 1]], [1, 1, 1]) -> error, (1, 3) and (1, 3) are not > conformable > # But this should work (second argument should be treated as (3, 1), not > (1, 3)) > > So the default should be to be strict about core dimensions, unless > explicitly requested otherwise by the person defining the gufunc. > #5057 now implements this behavior, which I agree is a sensible thing to do. And it doesn't seem very likely that anyone (numpy tests aside!) is expecting the old behavior to hold. > > > Lastly, there is an interesting side effect of the way this broadcasting > is > > handled: if a gufunc specifies a core dimension in an output argument > only, > > and an `out` kwarg is not passed in, then the output array will have that > > core dimension set to be of size 1, e.g. if the signature of `f` is > > '(),()->(a)', then f(1, 2).shape is (1,). This has always felt funny to > me, > > and I think that an unspecified dimension in an output array should > either > > be specified by a passed out array, or raise an error about an > unspecified > > core dimension or something like that. Does this sound right? > > Does this have any use cases? My vote is that we simply disallow this > until we have concrete uses and can decide how to do it properly. That > way there won't be any backcompat concerns to deal with later. > The obvious example is "compute all pairwise distances." You can define a gufunc with signature (n,d)->(m) that does that, and wrap it in a Python function that makes sure that it always gets called with an array with m = n * (n - 1) for the last diemnsion of the out parameter. If you do not specify the out array, it will create one with m = 1, or with the implementation in #5057, m = -1, which I suppose will fail badly with a cryptic, apparently unrelated error. I don't think either of these has any practical use, and think there are only three sensible options here: 1. Raise an error if a core dimension wasn't specified by the passed arrays (in the face of ambiguity...) 2. Do not allow dimensions in the output arrays that are not also in one of the input arrays. 3. Provide a method for the gufunc to decide on its own what m should be based on the other core dimensions. I like the idea of 3, but we need a lot of discussion on whether it really is a good idea, and what the best mechanism to implement it is. I think 2 would be a mistake, unless something like 3 was in place, for being too restrictive. And I see 1 as the quickest short term solution to fix something that is broken, or would be if gufuncs were being more extensively used out there. Jaime -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Sep 17 04:30:30 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 17 Sep 2014 10:30:30 +0200 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: Message-ID: <1410942630.19651.1.camel@sebastian-t440> On Di, 2014-09-16 at 16:51 -0400, Nathaniel Smith wrote: > On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o > wrote: > > If it is a bug, it is an extended one, because it is the same behavior of > > einsum: > > > >>>> np.einsum('i,i', [1,1,1], [1]) > > 3 > >>>> np.einsum('i,i', [1,1,1], [1,1]) > > Traceback (most recent call last): > > File "", line 1, in > > ValueError: operands could not be broadcast together with remapped shapes > > [origi > > nal->remapped]: (3,)->(3,) (2,)->(2,) > > > > And I think it is a conscious design decision, there is a comment about > > broadcasting missing core dimensions here: > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 > > "intentional" and "sensible" are not always the same thing :-). That > said, it isn't totally obvious to me what the correct behaviour for > einsum is in this case. > > > and the code makes it very explicit that input argument dimensions with the > > same label are broadcast to a common shape, see here: > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > > > > I kind of expect numpy to broadcast whenever possible, so this doesn't feel > > wrong to me. > > The case Chuck is talking about is like if we allowed matrix > multiplication between an array with shape (n, 1) with an array with > (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY > wrong to me, will certainly hide many bugs, and is definitely not how > it works right now (for np.dot, anyway; apparently it does work that > way for the brand-new gufunc np.linalg.matrix_multiply, but this must > be an accident). Agreed, the only argument to not change it right away would be being afraid of breaking user code abusing the kind of thing Josef mentioned. - Sebastian > > > That said, it is hard to come up with convincing examples of how this > > behavior would be useful in any practical context. But changing something > > that has been working like that for so long seems like a risky thing. And I > > cannot come with a convincing example of why it would be harmful either. > > gufuncs are very new. > > -n > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From charlesr.harris at gmail.com Wed Sep 17 08:33:26 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 17 Sep 2014 06:33:26 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: <1410942630.19651.1.camel@sebastian-t440> References: <1410942630.19651.1.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 2:30 AM, Sebastian Berg wrote: > On Di, 2014-09-16 at 16:51 -0400, Nathaniel Smith wrote: > > On Tue, Sep 16, 2014 at 4:31 PM, Jaime Fern?ndez del R?o > > wrote: > > > If it is a bug, it is an extended one, because it is the same behavior > of > > > einsum: > > > > > >>>> np.einsum('i,i', [1,1,1], [1]) > > > 3 > > >>>> np.einsum('i,i', [1,1,1], [1,1]) > > > Traceback (most recent call last): > > > File "", line 1, in > > > ValueError: operands could not be broadcast together with remapped > shapes > > > [origi > > > nal->remapped]: (3,)->(3,) (2,)->(2,) > > > > > > And I think it is a conscious design decision, there is a comment about > > > broadcasting missing core dimensions here: > > > > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1940 > > > > "intentional" and "sensible" are not always the same thing :-). That > > said, it isn't totally obvious to me what the correct behaviour for > > einsum is in this case. > > > > > and the code makes it very explicit that input argument dimensions > with the > > > same label are broadcast to a common shape, see here: > > > > > > > https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L1956 > > > > > > I kind of expect numpy to broadcast whenever possible, so this doesn't > feel > > > wrong to me. > > > > The case Chuck is talking about is like if we allowed matrix > > multiplication between an array with shape (n, 1) with an array with > > (k, m), because (n, 1) can be broadcast to (n, k). This feels VERY > > wrong to me, will certainly hide many bugs, and is definitely not how > > it works right now (for np.dot, anyway; apparently it does work that > > way for the brand-new gufunc np.linalg.matrix_multiply, but this must > > be an accident). > > Agreed, the only argument to not change it right away would be being > afraid of breaking user code abusing the kind of thing Josef mentioned. > > It *is* a big change. I think of the gufuncs as working with matrices and vectors, the array version of the matrix class. In that case the signature shapes must be preserved and there should be no broadcasting within the signature. The matrix class itself fails in that regard: In [1]: matrix(eye(3)) + matrix([[1,1,1]]) Out[1]: matrix([[ 2., 1., 1.], [ 1., 2., 1.], [ 1., 1., 2.]]) Which is all wrong for a matrix type. It would also be nice if the order could be made part of the signature as DGEMM and friends like one of the argument axis to be contiguous, but I don't see a clean way to do that. The gufuncs do have an order parameter which should probably default to 'C' if the arrays/vectors are stacked. I think the default is currently 'K'. Hmm, we could make 'K' refer to the last one or two dimensions in the inputs. OTOH, that isn't needed for types not handled by BLAS. Or it could be handled in the inner loops. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Sep 17 08:48:14 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 17 Sep 2014 14:48:14 +0200 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: <1410942630.19651.1.camel@sebastian-t440> Message-ID: <1410958094.21667.4.camel@sebastian-t440> On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: > > > > > It would also be nice if the order could be made part of the signature > as DGEMM and friends like one of the argument axis to be contiguous, > but I don't see a clean way to do that. The gufuncs do have an order > parameter which should probably default to 'C' if the arrays/vectors > are stacked. I think the default is currently 'K'. Hmm, we could make > 'K' refer to the last one or two dimensions in the inputs. OTOH, that > isn't needed for types not handled by BLAS. Or it could be handled in > the inner loops. > This is a different discussion, right? It would be nice to have an order flag for the core dimensions. The gufunc itself should not care at all about the outer ones. All the orders for the core dimensions would be nice probably, including no contiguity being enforced (or actually, maybe we can define 'K' to mean that in this context). To be honest, if 'K' means that, it seems like a decent default. - Sebastian > > > > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From charlesr.harris at gmail.com Wed Sep 17 08:57:49 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 17 Sep 2014 06:57:49 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: <1410958094.21667.4.camel@sebastian-t440> References: <1410942630.19651.1.camel@sebastian-t440> <1410958094.21667.4.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 6:48 AM, Sebastian Berg wrote: > On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: > > > > > > > > > > > It would also be nice if the order could be made part of the signature > > as DGEMM and friends like one of the argument axis to be contiguous, > > but I don't see a clean way to do that. The gufuncs do have an order > > parameter which should probably default to 'C' if the arrays/vectors > > are stacked. I think the default is currently 'K'. Hmm, we could make > > 'K' refer to the last one or two dimensions in the inputs. OTOH, that > > isn't needed for types not handled by BLAS. Or it could be handled in > > the inner loops. > > > > This is a different discussion, right? It would be nice to have an order > flag for the core dimensions. The gufunc itself should not care at all > about the outer ones. > Right. It is possible to check all these things in the loop, but the loop code grows.. . > All the orders for the core dimensions would be nice probably, including > no contiguity being enforced (or actually, maybe we can define 'K' to > mean that in this context). To be honest, if 'K' means that, it seems > like a decent default. > > With regards to the main topic, we could extend the signature notation, using `[...]` instead of `(...)' for the new behavior. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 17 16:27:30 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 17 Sep 2014 14:27:30 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: <1410942630.19651.1.camel@sebastian-t440> <1410958094.21667.4.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 6:57 AM, Charles R Harris wrote: > > > On Wed, Sep 17, 2014 at 6:48 AM, Sebastian Berg < > sebastian at sipsolutions.net> wrote: > >> On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: >> > >> > >> >> > >> > >> > It would also be nice if the order could be made part of the signature >> > as DGEMM and friends like one of the argument axis to be contiguous, >> > but I don't see a clean way to do that. The gufuncs do have an order >> > parameter which should probably default to 'C' if the arrays/vectors >> > are stacked. I think the default is currently 'K'. Hmm, we could make >> > 'K' refer to the last one or two dimensions in the inputs. OTOH, that >> > isn't needed for types not handled by BLAS. Or it could be handled in >> > the inner loops. >> > >> >> This is a different discussion, right? It would be nice to have an order >> flag for the core dimensions. The gufunc itself should not care at all >> about the outer ones. >> > > Right. It is possible to check all these things in the loop, but the loop > code grows.. > . > >> All the orders for the core dimensions would be nice probably, including >> no contiguity being enforced (or actually, maybe we can define 'K' to >> mean that in this context). To be honest, if 'K' means that, it seems >> like a decent default. >> >> > With regards to the main topic, we could extend the signature notation, > using `[...]` instead of `(...)' for the new behavior. > Or we could add a new function, PyUFunc_StrictGeneralizedFunction, with the new behavior. that might be the safe way to go. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 17 17:01:02 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 17 Sep 2014 14:01:02 -0700 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: <1410942630.19651.1.camel@sebastian-t440> <1410958094.21667.4.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 1:27 PM, Charles R Harris wrote: > > > On Wed, Sep 17, 2014 at 6:57 AM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Wed, Sep 17, 2014 at 6:48 AM, Sebastian Berg < >> sebastian at sipsolutions.net> wrote: >> >>> On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: >>> > >>> > >>> >>> > >>> > >>> > It would also be nice if the order could be made part of the signature >>> > as DGEMM and friends like one of the argument axis to be contiguous, >>> > but I don't see a clean way to do that. The gufuncs do have an order >>> > parameter which should probably default to 'C' if the arrays/vectors >>> > are stacked. I think the default is currently 'K'. Hmm, we could make >>> > 'K' refer to the last one or two dimensions in the inputs. OTOH, that >>> > isn't needed for types not handled by BLAS. Or it could be handled in >>> > the inner loops. >>> > >>> >>> This is a different discussion, right? It would be nice to have an order >>> flag for the core dimensions. The gufunc itself should not care at all >>> about the outer ones. >>> >> >> Right. It is possible to check all these things in the loop, but the loop >> code grows.. >> . >> >>> All the orders for the core dimensions would be nice probably, including >>> no contiguity being enforced (or actually, maybe we can define 'K' to >>> mean that in this context). To be honest, if 'K' means that, it seems >>> like a decent default. >>> >>> >> With regards to the main topic, we could extend the signature notation, >> using `[...]` instead of `(...)' for the new behavior. >> > > Or we could add a new function, PyUFunc_StrictGeneralizedFunction, with > the new behavior. that might be the safe way to go. > That sounds good to me, the current flow is that 'ufunc_generic_call', which is the function in the tp_call slot of the PyUFunc object, calls 'PyUFunc_GenericFunction', which will call 'PyUFunc_GeneralizedFunction' if the 'core_enabled' member variable is set to 1. We could have a new 'PyUFunc_StrictFromFuncAndDataAndSignature' that sets the 'core_enabled' variable to e.g. 2, and then dispatch on this value in 'PyUFunc_GenericFunction' to the new 'PyUFunc_StrictGeneralizedFunction'. This will also give us a better sandbox to experiment with all the other enhancements we have been talking about: frozen dimensions, optional dimensions, computed dimensions... I am guessing we still want to deprecate the old behavior in the next release and remove it entirely in a couple more, right? Jaime > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 17 17:29:21 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 17 Sep 2014 15:29:21 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: <1410942630.19651.1.camel@sebastian-t440> <1410958094.21667.4.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 3:01 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Wed, Sep 17, 2014 at 1:27 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Wed, Sep 17, 2014 at 6:57 AM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Wed, Sep 17, 2014 at 6:48 AM, Sebastian Berg < >>> sebastian at sipsolutions.net> wrote: >>> >>>> On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: >>>> > >>>> > >>>> >>>> > >>>> > >>>> > It would also be nice if the order could be made part of the signature >>>> > as DGEMM and friends like one of the argument axis to be contiguous, >>>> > but I don't see a clean way to do that. The gufuncs do have an order >>>> > parameter which should probably default to 'C' if the arrays/vectors >>>> > are stacked. I think the default is currently 'K'. Hmm, we could make >>>> > 'K' refer to the last one or two dimensions in the inputs. OTOH, that >>>> > isn't needed for types not handled by BLAS. Or it could be handled in >>>> > the inner loops. >>>> > >>>> >>>> This is a different discussion, right? It would be nice to have an order >>>> flag for the core dimensions. The gufunc itself should not care at all >>>> about the outer ones. >>>> >>> >>> Right. It is possible to check all these things in the loop, but the >>> loop code grows.. >>> . >>> >>>> All the orders for the core dimensions would be nice probably, including >>>> no contiguity being enforced (or actually, maybe we can define 'K' to >>>> mean that in this context). To be honest, if 'K' means that, it seems >>>> like a decent default. >>>> >>>> >>> With regards to the main topic, we could extend the signature notation, >>> using `[...]` instead of `(...)' for the new behavior. >>> >> >> Or we could add a new function, PyUFunc_StrictGeneralizedFunction, with >> the new behavior. that might be the safe way to go. >> > > That sounds good to me, the current flow is that 'ufunc_generic_call', > which is the function in the tp_call slot of the PyUFunc object, calls > 'PyUFunc_GenericFunction', which will call 'PyUFunc_GeneralizedFunction' if > the 'core_enabled' member variable is set to 1. We could have a new > 'PyUFunc_StrictFromFuncAndDataAndSignature' that sets the 'core_enabled' > variable to e.g. 2, and then dispatch on this value in > 'PyUFunc_GenericFunction' to the new 'PyUFunc_StrictGeneralizedFunction'. > > This will also give us a better sandbox to experiment with all the other > enhancements we have been talking about: frozen dimensions, optional > dimensions, computed dimensions... > That sounds good, it is cleaner than the other solutions. The new constructor will need to be in the interface and the interface version updated. > I am guessing we still want to deprecate the old behavior in the next > release and remove it entirely in a couple more, right? > > Don't know. It is in the interface, so might want to just deprecate it and leave it laying around. Could maybe add an argument to the new constructor that sets the `core_enabled` value so we don't need to keep adding new functions to the api. If so, should probably be an enum in the include file so valid values get passed. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 17 17:37:34 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 17 Sep 2014 15:37:34 -0600 Subject: [Numpy-discussion] Is this a bug? In-Reply-To: References: <1410942630.19651.1.camel@sebastian-t440> <1410958094.21667.4.camel@sebastian-t440> Message-ID: On Wed, Sep 17, 2014 at 3:29 PM, Charles R Harris wrote: > > > On Wed, Sep 17, 2014 at 3:01 PM, Jaime Fern?ndez del R?o < > jaime.frio at gmail.com> wrote: > >> On Wed, Sep 17, 2014 at 1:27 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Wed, Sep 17, 2014 at 6:57 AM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> On Wed, Sep 17, 2014 at 6:48 AM, Sebastian Berg < >>>> sebastian at sipsolutions.net> wrote: >>>> >>>>> On Mi, 2014-09-17 at 06:33 -0600, Charles R Harris wrote: >>>>> > >>>>> > >>>>> >>>>> > >>>>> > >>>>> > It would also be nice if the order could be made part of the >>>>> signature >>>>> > as DGEMM and friends like one of the argument axis to be contiguous, >>>>> > but I don't see a clean way to do that. The gufuncs do have an order >>>>> > parameter which should probably default to 'C' if the arrays/vectors >>>>> > are stacked. I think the default is currently 'K'. Hmm, we could make >>>>> > 'K' refer to the last one or two dimensions in the inputs. OTOH, that >>>>> > isn't needed for types not handled by BLAS. Or it could be handled in >>>>> > the inner loops. >>>>> > >>>>> >>>>> This is a different discussion, right? It would be nice to have an >>>>> order >>>>> flag for the core dimensions. The gufunc itself should not care at all >>>>> about the outer ones. >>>>> >>>> >>>> Right. It is possible to check all these things in the loop, but the >>>> loop code grows.. >>>> . >>>> >>>>> All the orders for the core dimensions would be nice probably, >>>>> including >>>>> no contiguity being enforced (or actually, maybe we can define 'K' to >>>>> mean that in this context). To be honest, if 'K' means that, it seems >>>>> like a decent default. >>>>> >>>>> >>>> With regards to the main topic, we could extend the signature notation, >>>> using `[...]` instead of `(...)' for the new behavior. >>>> >>> >>> Or we could add a new function, PyUFunc_StrictGeneralizedFunction, with >>> the new behavior. that might be the safe way to go. >>> >> >> That sounds good to me, the current flow is that 'ufunc_generic_call', >> which is the function in the tp_call slot of the PyUFunc object, calls >> 'PyUFunc_GenericFunction', which will call 'PyUFunc_GeneralizedFunction' if >> the 'core_enabled' member variable is set to 1. We could have a new >> 'PyUFunc_StrictFromFuncAndDataAndSignature' that sets the 'core_enabled' >> variable to e.g. 2, and then dispatch on this value in >> 'PyUFunc_GenericFunction' to the new 'PyUFunc_StrictGeneralizedFunction'. >> >> This will also give us a better sandbox to experiment with all the other >> enhancements we have been talking about: frozen dimensions, optional >> dimensions, computed dimensions... >> > > That sounds good, it is cleaner than the other solutions. The new > constructor will need to be in the interface and the interface version > updated. > > >> I am guessing we still want to deprecate the old behavior in the next >> release and remove it entirely in a couple more, right? >> >> > Don't know. It is in the interface, so might want to just deprecate it and > leave it laying around. Could maybe add an argument to the new constructor > that sets the `core_enabled` value so we don't need to keep adding new > functions to the api. If so, should probably be an enum in the include file > so valid values get passed. > > And then Ufunc in the code_generator could be modified to take both utype (core_enabled value) and signature and use the new constructor. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From encukou at gmail.com Thu Sep 18 12:31:23 2014 From: encukou at gmail.com (Petr Viktorin) Date: Thu, 18 Sep 2014 18:31:23 +0200 Subject: [Numpy-discussion] (no subject) Message-ID: Hello, Over at Python-ideas, there is a thread [0] about the following discrepancy: >>> numpy.array(float('inf')) // 1 inf >>> float('inf') // 1 nan There are reasons for either result, but I believe it would be very nice if either Python or Numpy changed, so they would give the same value. If any of you have reasons to defend Numpy's (or Python's) choice, or otherwise want to chime in, please post there. Thanks, Petr Viktorin [0] https://mail.python.org/pipermail/python-ideas/2014-September/029365.html From encukou at gmail.com Thu Sep 18 12:34:23 2014 From: encukou at gmail.com (Petr Viktorin) Date: Thu, 18 Sep 2014 18:34:23 +0200 Subject: [Numpy-discussion] float('inf') // 1 = ? Message-ID: (Apologies for the lack of subject earlier) Hello, Over at Python-ideas, there is a thread [0] about the following discrepancy: >>> numpy.array(float('inf')) // 1 inf >>> float('inf') // 1 nan There are reasons for either result, but I believe it would be very nice if either Python or Numpy changed, so they would give the same value. If any of you have reasons to defend Numpy's (or Python's) choice, or otherwise want to chime in, please post there. Thanks, Petr Viktorin [0] https://mail.python.org/pipermail/python-ideas/2014-September/029365.html From ben.root at ou.edu Thu Sep 18 12:48:30 2014 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 18 Sep 2014 12:48:30 -0400 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: My vote is that NumPy is correct here. I see no reason why >>> float('inf') / 1 and >>> float('inf') // 1 should return different results. Ben Root On Thu, Sep 18, 2014 at 12:31 PM, Petr Viktorin wrote: > Hello, > Over at Python-ideas, there is a thread [0] about the following > discrepancy: > > >>> numpy.array(float('inf')) // 1 > inf > >>> float('inf') // 1 > nan > > There are reasons for either result, but I believe it would be very > nice if either Python or Numpy changed, so they would give the same > value. > If any of you have reasons to defend Numpy's (or Python's) choice, or > otherwise want to chime in, please post there. > > Thanks, > Petr Viktorin > > > [0] > https://mail.python.org/pipermail/python-ideas/2014-September/029365.html > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From encukou at gmail.com Thu Sep 18 12:55:08 2014 From: encukou at gmail.com (Petr Viktorin) Date: Thu, 18 Sep 2014 18:55:08 +0200 Subject: [Numpy-discussion] float('inf') // 1 = ? Message-ID: Sorry for the lack of subject before. On Thu, Sep 18, 2014 at 6:48 PM, Benjamin Root wrote: > My vote is that NumPy is correct here. I see no reason why >>>> float('inf') / 1 > and >>>> float('inf') // 1 > > should return different results. I recommend reading the python-ideas thread; there are some arguments for both sides. > On Thu, Sep 18, 2014 at 12:31 PM, Petr Viktorin wrote: >> >> Hello, >> Over at Python-ideas, there is a thread [0] about the following >> discrepancy: >> >> >>> numpy.array(float('inf')) // 1 >> inf >> >>> float('inf') // 1 >> nan >> >> There are reasons for either result, but I believe it would be very >> nice if either Python or Numpy changed, so they would give the same >> value. >> If any of you have reasons to defend Numpy's (or Python's) choice, or >> otherwise want to chime in, please post there. >> >> Thanks, >> Petr Viktorin >> >> >> [0] >> https://mail.python.org/pipermail/python-ideas/2014-September/029365.html >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From chris.barker at noaa.gov Thu Sep 18 13:01:38 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 18 Sep 2014 10:01:38 -0700 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: Well, First of all, numpy and the python math module have a number of differences when it comes to handling these kind of special cases -- and I think that: 1) numpy needs to do what makes the most sense for numpy and NOT mirror the math lib. 2) the use-cases of the math lib and numpy are different, so they maybe _should_ have different handling of this kind of thing. 3) I'm not sure that the core devs think these kinds of issues are "wrong" 'enough to break backward compatibility in subtle ways. But it's a fun topic in any case, and maybe numpy's behavior could be improved. > My vote is that NumPy is correct here. I see no reason why > >>> float('inf') / 1 > and > >>> float('inf') // 1 > > should return different results. > Well, one argument is that "floor division" is supposed to return an integer value, and that inf is NOT an integer value. The integral part of infinity doesn't exist and thus is Not a Number. You also get some weird edge cases around the mod operator. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Sep 18 13:13:58 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 18 Sep 2014 19:13:58 +0200 Subject: [Numpy-discussion] float('inf') // 1 = ? In-Reply-To: References: Message-ID: <1411060438.3474.2.camel@sebastian-laptop> On Thu, 2014-09-18 at 18:55 +0200, Petr Viktorin wrote: > Sorry for the lack of subject before. > > On Thu, Sep 18, 2014 at 6:48 PM, Benjamin Root wrote: > > My vote is that NumPy is correct here. I see no reason why > >>>> float('inf') / 1 > > and > >>>> float('inf') // 1 > > Didn't read python ideas I have to admit (at least not enough to see many arguments). My biggest argument is that numpy does exactly this: >>> import math >>> math.floor(float('inf')/1) Maybe I am naive thinking that floordivide is basically a divide+floor operation, but arguing for NaN seems half arsed to me on first sight. Either it is `inf` and the equivalence Benjamin Root noted holds or you make it an error. NaN is not reasonable for an error return *especially* not for the standard lib (it *might* be for numpy). - Sebastian > > should return different results. > > I recommend reading the python-ideas thread; there are some arguments > for both sides. > > > On Thu, Sep 18, 2014 at 12:31 PM, Petr Viktorin wrote: > >> > >> Hello, > >> Over at Python-ideas, there is a thread [0] about the following > >> discrepancy: > >> > >> >>> numpy.array(float('inf')) // 1 > >> inf > >> >>> float('inf') // 1 > >> nan > >> > >> There are reasons for either result, but I believe it would be very > >> nice if either Python or Numpy changed, so they would give the same > >> value. > >> If any of you have reasons to defend Numpy's (or Python's) choice, or > >> otherwise want to chime in, please post there. > >> > >> Thanks, > >> Petr Viktorin > >> > >> > >> [0] > >> https://mail.python.org/pipermail/python-ideas/2014-September/029365.html > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From jjhelmus at gmail.com Thu Sep 18 13:14:51 2014 From: jjhelmus at gmail.com (Jonathan Helmus) Date: Thu, 18 Sep 2014 12:14:51 -0500 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: <541B130B.8030307@gmail.com> On 09/18/2014 12:01 PM, Chris Barker wrote: > Well, > > First of all, numpy and the python math module have a number of > differences when it comes to handling these kind of special cases -- > and I think that: > > 1) numpy needs to do what makes the most sense for numpy and NOT > mirror the math lib. > > 2) the use-cases of the math lib and numpy are different, so they > maybe _should_ have different handling of this kind of thing. > > 3) I'm not sure that the core devs think these kinds of issues are > "wrong" 'enough to break backward compatibility in subtle ways. > > But it's a fun topic in any case, and maybe numpy's behavior could be > improved. > > My vote is that NumPy is correct here. I see no reason why > >>> float('inf') / 1 > and > >>> float('inf') // 1 > > should return different results. > > > Well, one argument is that "floor division" is supposed to return an > integer value, and that inf is NOT an integer value. The integral part > of infinity doesn't exist and thus is Not a Number. > But nan is not an integer value either: >>> float('inf') // 1 nan >>> int(float('inf') // 1) Traceback (most recent call last): File "", line 1, in ValueError: cannot convert float NaN to integer Perhaps float('inf') // 1 should raise a ValueError directly since there is no proper way perform the floor division on infinity. - Jonathan Helmus > You also get some weird edge cases around the mod operator. > > -Chris > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From encukou at gmail.com Thu Sep 18 13:44:09 2014 From: encukou at gmail.com (Petr Viktorin) Date: Thu, 18 Sep 2014 19:44:09 +0200 Subject: [Numpy-discussion] (no subject) In-Reply-To: <541B130B.8030307@gmail.com> References: <541B130B.8030307@gmail.com> Message-ID: On Thu, Sep 18, 2014 at 7:14 PM, Jonathan Helmus wrote: > On 09/18/2014 12:01 PM, Chris Barker wrote: > > Well, > > First of all, numpy and the python math module have a number of differences > when it comes to handling these kind of special cases -- and I think that: > > 1) numpy needs to do what makes the most sense for numpy and NOT mirror the > math lib. Sure. > 2) the use-cases of the math lib and numpy are different, so they maybe > _should_ have different handling of this kind of thing. If you have a reason for the difference, I'd like to hear it. > 3) I'm not sure that the core devs think these kinds of issues are "wrong" > 'enough to break backward compatibility in subtle ways. I'd be perfectly fine with it being documented and tested (in CPython) as either a design mistake or design choice. > But it's a fun topic in any case, and maybe numpy's behavior could be > improved. >> >> My vote is that NumPy is correct here. I see no reason why >> >>> float('inf') / 1 >> and >> >>> float('inf') // 1 >> >> should return different results. > > > Well, one argument is that "floor division" is supposed to return an integer > value, and that inf is NOT an integer value. The integral part of infinity > doesn't exist and thus is Not a Number. > > > But nan is not an integer value either: > >>>> float('inf') // 1 > nan >>>> int(float('inf') // 1) > Traceback (most recent call last): > File "", line 1, in > ValueError: cannot convert float NaN to integer > > Perhaps float('inf') // 1 should raise a ValueError directly since there is > no proper way perform the floor division on infinity. inf not even a *real* number; a lot of operations don't make mathematical sense on it. But most are defined anyway, and quite sanely. From chris.barker at noaa.gov Thu Sep 18 14:40:28 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 18 Sep 2014 11:40:28 -0700 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: <541B130B.8030307@gmail.com> Message-ID: On Thu, Sep 18, 2014 at 10:44 AM, Petr Viktorin wrote: > > 2) the use-cases of the math lib and numpy are different, so they maybe > > _should_ have different handling of this kind of thing. > > If you have a reason for the difference, I'd like to hear it. For one, numpy does array operations, and you really don't want a ValueError (or any Exception) raised when perhaps only one value in a huge array has an issue. The other is that numpy users are potentially more sophisticated with regard to numeric computing issues, and in any case, need to prioritize different things -- like performance over safety. > > But nan is not an integer value either: > I meant conceptually. sure -- it's not any number at all -- a NaN can be arrived at many ways, all it means something happened for which there was not an appropriate numerical answer -- even inf or -inf. So, the question is: is the integer part of inf infinity? or it undefined, and therefor NaN ? I can't image a use case where it would matter, which is probably why numpy returns inf. > Perhaps float('inf') // 1 should raise a ValueError directly since there > is > > no proper way perform the floor division on infinity. > not in numpy for sure -- but I don't see the point in the math lib either, let the NaN propagate and deal with later if you need to -- that's what they are for. - Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Sep 18 14:41:49 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 18 Sep 2014 20:41:49 +0200 Subject: [Numpy-discussion] float('inf') // 1 = ? In-Reply-To: <1411060438.3474.2.camel@sebastian-laptop> References: <1411060438.3474.2.camel@sebastian-laptop> Message-ID: <1411065709.19963.12.camel@sebastian-t440> On Do, 2014-09-18 at 19:13 +0200, Sebastian Berg wrote: > On Thu, 2014-09-18 at 18:55 +0200, Petr Viktorin wrote: > > Sorry for the lack of subject before. > > > > On Thu, Sep 18, 2014 at 6:48 PM, Benjamin Root wrote: > > > My vote is that NumPy is correct here. I see no reason why > > >>>> float('inf') / 1 > > > and > > >>>> float('inf') // 1 > > > > > Didn't read python ideas I have to admit (at least not enough to see > many arguments). My biggest argument is that numpy does exactly this: > > >>> import math > >>> math.floor(float('inf')/1) > > Maybe I am naive thinking that floordivide is basically a divide+floor > operation, but arguing for NaN seems half arsed to me on first sight. > Either it is `inf` and the equivalence Benjamin Root noted holds or you > make it an error. NaN is not reasonable for an error return *especially* > not for the standard lib (it *might* be for numpy). > Ok, sorry, that was too quick. Since we already have a "special" value like Inf, it is OK to not give an error in the standard lib, since Inf-Inf is also NaN, etc. However, I read the arguments as "Inf is not an integral number", and I frankly don't understand that at all. Infinity isn't a real number as well!? If you represent the result as an IEEE float (integral or not) you can use Inf as a valid result IMO. Arguably all of the limits: floor(float) -> Inf float // 1 -> Inf exist for float -> Inf and can be represented. Also IEEE seems to define the floor operation like this and I don't see a reason why to violate `a//b == floor(a/b)`. As long as the result is represented as an IEEE floating point, which knows Infinity, I argue that it is correct. If it is not an IEEE floating point, it is an error in any case. - Sebastian > - Sebastian > > > > should return different results. > > > > I recommend reading the python-ideas thread; there are some arguments > > for both sides. > > > > > On Thu, Sep 18, 2014 at 12:31 PM, Petr Viktorin wrote: > > >> > > >> Hello, > > >> Over at Python-ideas, there is a thread [0] about the following > > >> discrepancy: > > >> > > >> >>> numpy.array(float('inf')) // 1 > > >> inf > > >> >>> float('inf') // 1 > > >> nan > > >> > > >> There are reasons for either result, but I believe it would be very > > >> nice if either Python or Numpy changed, so they would give the same > > >> value. > > >> If any of you have reasons to defend Numpy's (or Python's) choice, or > > >> otherwise want to chime in, please post there. > > >> > > >> Thanks, > > >> Petr Viktorin > > >> > > >> > > >> [0] > > >> https://mail.python.org/pipermail/python-ideas/2014-September/029365.html > > >> _______________________________________________ > > >> NumPy-Discussion mailing list > > >> NumPy-Discussion at scipy.org > > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at scipy.org > > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From jjhelmus at gmail.com Thu Sep 18 15:30:33 2014 From: jjhelmus at gmail.com (Jonathan Helmus) Date: Thu, 18 Sep 2014 14:30:33 -0500 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: <541B130B.8030307@gmail.com> Message-ID: <541B32D9.6070501@gmail.com> On 09/18/2014 12:44 PM, Petr Viktorin wrote: > On Thu, Sep 18, 2014 at 7:14 PM, Jonathan Helmus wrote: >> On 09/18/2014 12:01 PM, Chris Barker wrote: >> >> Well, >> >> First of all, numpy and the python math module have a number of differences >> when it comes to handling these kind of special cases -- and I think that: >> >> 1) numpy needs to do what makes the most sense for numpy and NOT mirror the >> math lib. > Sure. > >> 2) the use-cases of the math lib and numpy are different, so they maybe >> _should_ have different handling of this kind of thing. > If you have a reason for the difference, I'd like to hear it. > >> 3) I'm not sure that the core devs think these kinds of issues are "wrong" >> 'enough to break backward compatibility in subtle ways. > I'd be perfectly fine with it being documented and tested (in CPython) > as either a design mistake or design choice. > >> But it's a fun topic in any case, and maybe numpy's behavior could be >> improved. >>> My vote is that NumPy is correct here. I see no reason why >>>>>> float('inf') / 1 >>> and >>>>>> float('inf') // 1 >>> should return different results. >> >> Well, one argument is that "floor division" is supposed to return an integer >> value, and that inf is NOT an integer value. The integral part of infinity >> doesn't exist and thus is Not a Number. >> >> >> But nan is not an integer value either: >> >>>>> float('inf') // 1 >> nan >>>>> int(float('inf') // 1) >> Traceback (most recent call last): >> File "", line 1, in >> ValueError: cannot convert float NaN to integer >> >> Perhaps float('inf') // 1 should raise a ValueError directly since there is >> no proper way perform the floor division on infinity. > inf not even a *real* number; a lot of operations don't make > mathematical sense on it. But most are defined anyway, and quite > sanely. But in IEEE-754 inf is a valid floating point number (where-as NaN is not) and has well defined arithmetic, specifically inf / 1 == inf and RoundToIntergral(inf) == inf. In the numpy example, the numpy.array(float('inf')) statement creates an array containing a float32 or float64 representation of inf. After this I would expect a floor division to return inf since that is what IEEE-754 arithmetic specifies. For me the question is if the floor division should also perform a cast to an integer type. Since inf cannot be represented in most common integer formats this should raise an exception. Since // does not normally perform a cast, for example type(float(5) // 2) == float, the point is mute. The real question is if Python floats follows IEEE-754 arithmetic or not. If they do not what standard are they going to follow? - Jonathan Helmus From insertinterestingnamehere at gmail.com Thu Sep 18 15:46:08 2014 From: insertinterestingnamehere at gmail.com (Ian Henriksen) Date: Thu, 18 Sep 2014 13:46:08 -0600 Subject: [Numpy-discussion] (no subject) In-Reply-To: <541B32D9.6070501@gmail.com> References: <541B130B.8030307@gmail.com> <541B32D9.6070501@gmail.com> Message-ID: On Thu, Sep 18, 2014 at 1:30 PM, Jonathan Helmus wrote: > On 09/18/2014 12:44 PM, Petr Viktorin wrote: > > On Thu, Sep 18, 2014 at 7:14 PM, Jonathan Helmus > wrote: > >> On 09/18/2014 12:01 PM, Chris Barker wrote: > >> > >> Well, > >> > >> First of all, numpy and the python math module have a number of > differences > >> when it comes to handling these kind of special cases -- and I think > that: > >> > >> 1) numpy needs to do what makes the most sense for numpy and NOT mirror > the > >> math lib. > > Sure. > > > >> 2) the use-cases of the math lib and numpy are different, so they maybe > >> _should_ have different handling of this kind of thing. > > If you have a reason for the difference, I'd like to hear it. > > > >> 3) I'm not sure that the core devs think these kinds of issues are > "wrong" > >> 'enough to break backward compatibility in subtle ways. > > I'd be perfectly fine with it being documented and tested (in CPython) > > as either a design mistake or design choice. > > > >> But it's a fun topic in any case, and maybe numpy's behavior could be > >> improved. > >>> My vote is that NumPy is correct here. I see no reason why > >>>>>> float('inf') / 1 > >>> and > >>>>>> float('inf') // 1 > >>> should return different results. > >> > >> Well, one argument is that "floor division" is supposed to return an > integer > >> value, and that inf is NOT an integer value. The integral part of > infinity > >> doesn't exist and thus is Not a Number. > >> > >> > >> But nan is not an integer value either: > >> > >>>>> float('inf') // 1 > >> nan > >>>>> int(float('inf') // 1) > >> Traceback (most recent call last): > >> File "", line 1, in > >> ValueError: cannot convert float NaN to integer > >> > >> Perhaps float('inf') // 1 should raise a ValueError directly since > there is > >> no proper way perform the floor division on infinity. > > inf not even a *real* number; a lot of operations don't make > > mathematical sense on it. But most are defined anyway, and quite > > sanely. > > But in IEEE-754 inf is a valid floating point number (where-as NaN is > not) and has well defined arithmetic, specifically inf / 1 == inf and > RoundToIntergral(inf) == inf. In the numpy example, the > numpy.array(float('inf')) statement creates an array containing a > float32 or float64 representation of inf. After this I would expect a > floor division to return inf since that is what IEEE-754 arithmetic > specifies. > > For me the question is if the floor division should also perform a cast > to an integer type. Since inf cannot be represented in most common > integer formats this should raise an exception. Since // does not > normally perform a cast, for example type(float(5) // 2) == float, the > point is mute. > > The real question is if Python floats follows IEEE-754 arithmetic or > not. If they do not what standard are they going to follow? > > - Jonathan Helmus > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Agreed. It's definitely best to follow the IEEE conventions. That will be the most commonly expected behavior, especially in ambiguous cases like this. -Ian Henriksen -------------- next part -------------- An HTML attachment was scrubbed... URL: From pmhobson at gmail.com Fri Sep 19 10:47:46 2014 From: pmhobson at gmail.com (Paul Hobson) Date: Fri, 19 Sep 2014 07:47:46 -0700 Subject: [Numpy-discussion] Generalize hstack/vstack --> stack; Block matrices like in matlab In-Reply-To: References: <101656916431878296.890307sturla.molden-gmail.com@news.gmane.org> Message-ID: Hey Ben, Side note: I've had to do the same thing for stitching curvilinear model grid coordinates together. Usings pandas DataFrames indexed by `i` and `j` is really good for this. You can offset the indices directly, unstack the DF, and the pandas will align for you. Happy to send an example along if you're curious. -p On Mon, Sep 8, 2014 at 9:55 AM, Benjamin Root wrote: > A use case would be "image stitching" or even data tiling. I have had to > implement something like this at work (so, I can't share it, unfortunately) > and it even goes so far as to allow the caller to specify how much the > tiles can overlap and such. The specification is ungodly hideous and I > doubt I would be willing to share it even if I could lest I release > code-thulu upon the world... > > I think just having this generalize stack feature would be nice start. > Tetris could be built on top of that later. (Although, I do vote for at > least 3 or 4 dimensional stacking, if possible). > > Cheers! > Ben Root > > > On Mon, Sep 8, 2014 at 12:41 PM, Eelco Hoogendoorn < > hoogendoorn.eelco at gmail.com> wrote: > >> Sturla: im not sure if the intention is always unambiguous, for such more >> flexible arrangements. >> >> Also, I doubt such situations arise often in practice; if the arrays arnt >> a grid, they are probably a nested grid, and the code would most naturally >> concatenate them with nested calls to a stacking function. >> >> However, some form of nd-stack function would be neat in my opinion. >> >> On Mon, Sep 8, 2014 at 6:10 PM, Jaime Fern?ndez del R?o < >> jaime.frio at gmail.com> wrote: >> >>> On Mon, Sep 8, 2014 at 7:41 AM, Sturla Molden >>> wrote: >>> >>>> Stefan Otte wrote: >>>> >>>> > stack([[a, b], [c, d]]) >>>> > >>>> > In my case `stack` replaced `hstack` and `vstack` almost completely. >>>> > >>>> > If you're interested in including it in numpy I created a pull request >>>> > [1]. I'm looking forward to getting some feedback! >>>> >>>> As far as I can see, it uses hstack and vstack. But that means a and b >>>> have >>>> to have the same number of rows, c and d must have the same rumber of >>>> rows, >>>> and hstack((a,b)) and hstack((c,d)) must have the same number of >>>> columns. >>>> >>>> Thus it requires a regularity like this: >>>> >>>> AAAABB >>>> AAAABB >>>> CCCDDD >>>> CCCDDD >>>> CCCDDD >>>> CCCDDD >>>> >>>> What if we just ignore this constraint, and only require the output to >>>> be >>>> rectangular? Now we have a 'tetris game': >>>> >>>> AAAABB >>>> AAAABB >>>> CCCCBB >>>> CCCCBB >>>> CCCCDD >>>> CCCCDD >>>> >>>> or >>>> >>>> AAAABB >>>> AAAABB >>>> CCCCBB >>>> CCCCBB >>>> CCCCBB >>>> CCCCBB >>>> >>>> This should be 'stackable', yes? Or perhaps we need another stacking >>>> function for this, say numpy.tetris? >>>> >>>> And while we're at it, what about higher dimensions? should there be an >>>> ndstack function too? >>>> >>> >>> This is starting to look like the second time in a row Stefan tries to >>> extend numpy with a simple convenience function, and he gets tricked into >>> implementing some sophisticated algorithm... >>> >>> For his next PR I expect nothing less than an NP-complete problem. ;-) >>> >>> >>>> Jaime >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus >>> planes de dominaci?n mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From demitri.muna at gmail.com Sun Sep 21 17:10:03 2014 From: demitri.muna at gmail.com (Demitri Muna) Date: Sun, 21 Sep 2014 17:10:03 -0400 Subject: [Numpy-discussion] Numpy 'None' comparison FutureWarning Message-ID: Hi, I just encountered the following in my code: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. I'm very concerned about this. This is a very common programming pattern (lazy loading): class A(object): def __init__(self): self._some_array = None @property def some_array(self): if self._some_array == None: # perform some expensive setup of array return self._some_array It seems to me that the new behavior will break this pattern. I think that redefining the "==" operator is a little too aggressive here. It strikes me as very nonstandard and not at all obvious to someone reading the code that the comparison is a very special case for numpy objects. Unless there's some aspect I'm missing here, I think an element-wise comparator should be more explicit. Cheers, Demitri _________________________________________ Demitri Muna Department of Astronomy Le Ohio State University http://trillianverse.org http://scicoder.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From efiring at hawaii.edu Sun Sep 21 17:19:56 2014 From: efiring at hawaii.edu (Eric Firing) Date: Sun, 21 Sep 2014 11:19:56 -1000 Subject: [Numpy-discussion] Numpy 'None' comparison FutureWarning In-Reply-To: References: Message-ID: <541F40FC.6010105@hawaii.edu> On 2014/09/21, 11:10 AM, Demitri Muna wrote: > Hi, > > I just encountered the following in my code: > > FutureWarning: comparison to `None` will result in an elementwise object > comparison in the future. > > I'm very concerned about this. This is a very common programming pattern > (lazy loading): > > class A(object): > def __init__(self): > self._some_array = None > > @property > def some_array(self): > if self._some_array == None: > # perform some expensive setup of array > return self._some_array > > It seems to me that the new behavior will break this pattern. I think > that redefining the "==" operator is a little too aggressive here. It > strikes me as very nonstandard and not at all obvious to someone reading > the code that the comparison is a very special case for numpy objects. > Unless there's some aspect I'm missing here, I think an element-wise > comparator should be more explicit. I think what you are missing is that the standard Python idiom for this use case is "if self._some_array is None:". This will continue to work, regardless of whether the object being checked is an ndarray or any other Python object. Eric > > Cheers, > Demitri > > _________________________________________ > Demitri Muna > > Department of Astronomy > Le Ohio State University > > http://trillianverse.org > http://scicoder.org > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From shoyer at gmail.com Sun Sep 21 19:50:12 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 21 Sep 2014 16:50:12 -0700 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type Message-ID: pandas has some hacks to support custom types of data for which numpy can't handle well enough or at all. Examples include datetime and Categorical [1], and others like GeoArray [2] that haven't make it into pandas yet. Most of these look like numpy arrays but with custom dtypes and type specific methods/properties. But clearly nobody is particularly excited about writing the the C necessary to implement custom dtypes [3]. Nor is do we need the ndarray ABI. In many cases, writing C may not actually even be necessary for performance reasons, e.g., categorical can be fast enough just by wrapping an integer ndarray for the internal storage and using vectorized operations. And even if it is necessary, I think we'd all rather write Cython than C. It's great for pandas to write its own ndarray-like wrappers (*not* subclasses) that work with pandas, but it's a shame that there isn't a standard interface like the ndarray to make these arrays useable for the rest of the scientific Python ecosystem. For example, pandas has loads of fixes for np.datetime64, but nobody seems to be up for porting them to numpy (I doubt it would be easy). I know these sort of concerns are not new, but I wish I had a sense of what the solution looks like. Is anyone actively working on these issues? Does the fix belong in numpy, pandas, blaze or a new project? I'd love to get a sense of where things stand and how I could help -- without writing any C :). Thanks, Stephan [1] https://github.com/pydata/pandas/pull/7217 [2] https://github.com/geopandas/geopandas/issues/166 [3] https://github.com/numpy/numpy-dtypes -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sun Sep 21 20:13:39 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 21 Sep 2014 18:13:39 -0600 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Sun, Sep 21, 2014 at 5:50 PM, Stephan Hoyer wrote: > pandas has some hacks to support custom types of data for which numpy > can't handle well enough or at all. Examples include datetime and > Categorical [1], and others like GeoArray [2] that haven't make it into > pandas yet. > > Most of these look like numpy arrays but with custom dtypes and type > specific methods/properties. But clearly nobody is particularly excited > about writing the the C necessary to implement custom dtypes [3]. Nor is do > we need the ndarray ABI. > > In many cases, writing C may not actually even be necessary for > performance reasons, e.g., categorical can be fast enough just by wrapping > an integer ndarray for the internal storage and using vectorized > operations. And even if it is necessary, I think we'd all rather write > Cython than C. > > It's great for pandas to write its own ndarray-like wrappers (*not* > subclasses) that work with pandas, but it's a shame that there isn't a > standard interface like the ndarray to make these arrays useable for the > rest of the scientific Python ecosystem. For example, pandas has loads of > fixes for np.datetime64, but nobody seems to be up for porting them to > numpy (I doubt it would be easy). > > I know these sort of concerns are not new, but I wish I had a sense of > what the solution looks like. Is anyone actively working on these issues? > Does the fix belong in numpy, pandas, blaze or a new project? I'd love to > get a sense of where things stand and how I could help -- without writing > any C :). > > I haven't thought much about this myself, but others (Nathaniel?) have, and it would be good to explore the topic and maybe put together some examples/templates to make this approach easier. Input from someone with some experience would be *much* appreciated. The datetime problem persists and I've thinking it would be nice to replace the current implementation with something simpler that can be stolen from elsewhere. It would be nice to hear how someone else dealt with the problem. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Sun Sep 21 21:30:55 2014 From: ben.root at ou.edu (Benjamin Root) Date: Sun, 21 Sep 2014 21:30:55 -0400 Subject: [Numpy-discussion] Numpy 'None' comparison FutureWarning In-Reply-To: <541F40FC.6010105@hawaii.edu> References: <541F40FC.6010105@hawaii.edu> Message-ID: That being said, I do wonder about related situations where the lhs of the equal sign might be an array, or it might be a None and you are comparing against another numpy array. In those situations, you aren't trying to compare against None, you are just checking if two objects are equivalent. When is this change planned? Ben Root On Sun, Sep 21, 2014 at 5:19 PM, Eric Firing wrote: > On 2014/09/21, 11:10 AM, Demitri Muna wrote: > > Hi, > > > > I just encountered the following in my code: > > > > FutureWarning: comparison to `None` will result in an elementwise object > > comparison in the future. > > > > I'm very concerned about this. This is a very common programming pattern > > (lazy loading): > > > > class A(object): > > def __init__(self): > > self._some_array = None > > > > @property > > def some_array(self): > > if self._some_array == None: > > # perform some expensive setup of array > > return self._some_array > > > > It seems to me that the new behavior will break this pattern. I think > > that redefining the "==" operator is a little too aggressive here. It > > strikes me as very nonstandard and not at all obvious to someone reading > > the code that the comparison is a very special case for numpy objects. > > Unless there's some aspect I'm missing here, I think an element-wise > > comparator should be more explicit. > > > I think what you are missing is that the standard Python idiom for this > use case is "if self._some_array is None:". This will continue to work, > regardless of whether the object being checked is an ndarray or any > other Python object. > > Eric > > > > > Cheers, > > Demitri > > > > _________________________________________ > > Demitri Muna > > > > Department of Astronomy > > Le Ohio State University > > > > http://trillianverse.org > > http://scicoder.org > > > > > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From demitri.muna at gmail.com Sun Sep 21 22:02:26 2014 From: demitri.muna at gmail.com (Demitri Muna) Date: Sun, 21 Sep 2014 22:02:26 -0400 Subject: [Numpy-discussion] Numpy 'None' comparison FutureWarning In-Reply-To: <541F40FC.6010105@hawaii.edu> References: <541F40FC.6010105@hawaii.edu> Message-ID: On Sep 21, 2014, at 5:19 PM, Eric Firing wrote: > I think what you are missing is that the standard Python idiom for this > use case is "if self._some_array is None:". This will continue to work, > regardless of whether the object being checked is an ndarray or any > other Python object. That's an alternative, but I think it's a subtle distinction that will be lost on many users. I still think that this is something that can easily trip up many people; it's not clear from looking at the code that this is the behavior; it's "hidden". At the very least, I strongly suggest that the warning point this out, e.g. "FutureWarning: comparison to `None` will result in an elementwise object comparison in the future; use 'value is None' as an alternative." Assume: a = np.array([1, 2, 3, 4]) b = np.array([None, None, None, None]) What is the result of "a == None"? Is it "np.array([False, False, False, False])"? What about the second case? Is the result of "b == None" -> np.array([True, True, True, True])? If so, then if (b == None): ... will always evaluate to "True" if b is "None" or *any* Numpy array, and that's clearly unexpected behavior. On Sep 21, 2014, at 9:30 PM, Benjamin Root wrote: > That being said, I do wonder about related situations where the lhs of the equal sign might be an array, or it might be a None and you are comparing against another numpy array. In those situations, you aren't trying to compare against None, you are just checking if two objects are equivalent. Right. With this change, using "==" with numpy arrays now sometimes means "are these equivalent" and other times "element-wise comparison". The potential for inadvertent bugs is far greater than what convenience this redefinition of a very basic operator might offer. Any scenario where (a == b) != (b == a) is asking for trouble. Cheers, Demitri _________________________________________ Demitri Muna Department of Astronomy An Ohio State University http://trillianverse.org http://scicoder.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Sep 21 22:53:57 2014 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 21 Sep 2014 22:53:57 -0400 Subject: [Numpy-discussion] Numpy 'None' comparison FutureWarning In-Reply-To: References: <541F40FC.6010105@hawaii.edu> Message-ID: On 22 Sep 2014 03:02, "Demitri Muna" wrote: > > > On Sep 21, 2014, at 5:19 PM, Eric Firing wrote: > >> I think what you are missing is that the standard Python idiom for this >> use case is "if self._some_array is None:". This will continue to work, >> regardless of whether the object being checked is an ndarray or any >> other Python object. > > > That's an alternative, but I think it's a subtle distinction that will be lost on many users. I still think that this is something that can easily trip up many people; it's not clear from looking at the code that this is the behavior; it's "hidden". At the very least, I strongly suggest that the warning point this out, e.g. > > "FutureWarning: comparison to `None` will result in an elementwise object comparison in the future; use 'value is None' as an alternative." Making messages clearer is always welcome, and we devs aren't always in the best position to do so because we're to close to the issues to see which parts are confusing to outsiders - perhaps you'd like to submit a pull request with this? > Assume: > > a = np.array([1, 2, 3, 4]) > b = np.array([None, None, None, None]) > > What is the result of "a == None"? Is it "np.array([False, False, False, False])"? After this change, yes. > What about the second case? Is the result of "b == None" -> np.array([True, True, True, True])? Yes again. (Notice that this is also a subtle and confusing point for many users - how many people realize that if they want to get the latter result they have to write np.equal(b, None)?) > If so, then > > if (b == None): > ... > > will always evaluate to "True" if b is "None" or *any* Numpy array, and that's clearly unexpected behavior. No, that's not how numpy arrays interact with if statements. This is independent of the handling of 'arr == None': 'if multi_element_array' is always an error, because an if statement by definition requires a single true/false decision (it can't execute both branches after all!), but a multi-element array by definition contains multiple values that might have contradictory truthiness. Currently, 'b == x' returns an array in every situation *except* when x happens to be 'None'. After this change, 'b == x' will *always* return an array, so 'if b == x' will always raise an error. > > On Sep 21, 2014, at 9:30 PM, Benjamin Root wrote: > >> That being said, I do wonder about related situations where the lhs of the equal sign might be an array, or it might be a None and you are comparing against another numpy array. In those situations, you aren't trying to compare against None, you are just checking if two objects are equivalent. Benjamin, can you give a more concrete example? Right now the *only* time == on arrays checks for equivalence is when the object being compared against is None, in which case == pretends to be 'is' because of this mysterious special case. In every other case it does a broadcasted ==, which is very different. > Right. With this change, using "==" with numpy arrays now sometimes means "are these equivalent" and other times "element-wise comparison". Err, you have this backwards :-). Right now == means element-wise comparison except in this one special case, where it doesn't. After the change, it will mean element-wise comparison consistently in all cases. > The potential for inadvertent bugs is far greater than what convenience this redefinition of a very basic operator might offer. Any scenario where > > (a == b) != (b == a) > > is asking for trouble. That would be unfortunate, yes, but fortunately it doesn't apply here :-). 'a == b' and 'b == a' currently always return the same thing, and there are no plans to change this - we'll be changing what both of them mean at the same time. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Sep 21 23:31:30 2014 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 21 Sep 2014 23:31:30 -0400 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer wrote: > pandas has some hacks to support custom types of data for which numpy can't > handle well enough or at all. Examples include datetime and Categorical [1], > and others like GeoArray [2] that haven't make it into pandas yet. > > Most of these look like numpy arrays but with custom dtypes and type > specific methods/properties. But clearly nobody is particularly excited > about writing the the C necessary to implement custom dtypes [3]. Nor is do > we need the ndarray ABI. > > In many cases, writing C may not actually even be necessary for performance > reasons, e.g., categorical can be fast enough just by wrapping an integer > ndarray for the internal storage and using vectorized operations. And even > if it is necessary, I think we'd all rather write Cython than C. > > It's great for pandas to write its own ndarray-like wrappers (*not* > subclasses) that work with pandas, but it's a shame that there isn't a > standard interface like the ndarray to make these arrays useable for the > rest of the scientific Python ecosystem. For example, pandas has loads of > fixes for np.datetime64, but nobody seems to be up for porting them to numpy > (I doubt it would be easy). Writing them in the first place probably wasn't easy either :-). I don't really know why pandas spends so much effort on reimplementing stuff and papering over numpy limitations instead of fixing things upstream so that everyone can benefit. I assume they have reasons, and I could make some general guesses at what some of them might be, but if you want to know what they are -- which is presumably the first step in changing the situation -- you'll have to ask them, not us :-). > I know these sort of concerns are not new, but I wish I had a sense of what > the solution looks like. Is anyone actively working on these issues? Does > the fix belong in numpy, pandas, blaze or a new project? I'd love to get a > sense of where things stand and how I could help -- without writing any C > :). I think there are there are three parts: For stuff that's literally just fixing bugs in stuff that numpy already has, then we'd certainly be happy to accept those bug fixes. Probably there are things we can do to make this easier, I dunno. I'd love to see some of numpy's internals moving into Cython to make them easier to hack on, but this won't be simple because right now using Cython to implement a module is really an all-or-nothing affair; making it possible to mix Cython with numpy's existing C code will require upstream changes in Cython. For cases where people genuinely want to implement a new array-like types (e.g. DataFrame or scipy.sparse) then numpy provides a fair amount of support for this already (e.g., the various hooks that allow things like np.asarray(mydf) or np.sin(mydf) to work), and we're working on adding more over time (e.g., __numpy_ufunc__). My feeling though is that in most of the cases you mention, implementing a new array-like type is huge overkill. ndarray's interface is vast and reimplementing even 90% of it is a huge effort. For most of the cases that people seem to run into in practice, the solution is to enhance numpy's dtype interface so that it's possible for mere mortals to implement new dtypes, e.g. by just subclassing np.dtype. This is totally doable and would enable a ton of awesomeness, but it requires someone with the time to sit down and work on it, and no-one has volunteered yet. Unfortunately it does require hacking on C code though. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From shoyer at gmail.com Sun Sep 21 23:44:37 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 21 Sep 2014 20:44:37 -0700 Subject: [Numpy-discussion] ANN: xray 0.3 released Message-ID: I'm pleased to announce the v0.3 release for xray, N-D labeled arrays and datasets in Python. xray is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures, Series and DataFrame: the xray DataArray and Dataset. Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self-describing scientific data in widespread use in the Earth sciences (e.g., netCDF and OPeNDAP): xray.Dataset is an in-memory representation of a netCDF file. Documentation: http://xray.readthedocs.org Code: http://github.com/xray/xray Mailing list: xray-dev at googlegroups.com Comments, feedback and contributions are all very welcome! Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Sep 22 10:34:55 2014 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 22 Sep 2014 10:34:55 -0400 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: Hopefully this is not TL;DR! Their are 3 'dtype' likes that exist in pandas that could in theory mostly be migrated back to numpy. These currently exist as the .values in-other-words the object to which pandas defers data storage and computation for some/most of operations. 1) SparseArray: This is the basis for SparseSeries. It is ndarray-like (its actually a ndarray-sub-class) and optimized for the 1-d case. My guess is that @wesm created this because it a) didn't exist in numpy, and b) didn't want scipy as an explicity dependency (at the time), late 2011. 2) datetime support: This is not a target dtype per se, but really a reimplementation over the top of datetime64[ns], with the associated scalar Timestamp which is a proper sub-class of datetime.datetime. I believe @wesm created this because numpy datetime support was (and still is to some extent) just completely broken (though better in 1.7+). It doesn't support proper timezones, the display is always in the local timezone., and the scalar type (np.datetime64) is not extensible at all (e.g. so have not easy to have custom printing, or parsing). These are all well known by the numpy community and have seen some recent proposals to remedy. 3) pd.Categorical: This was another class wesm wrote several years ago. It is actually *could* be a numpy sub-class, though its a bit awkward as its really a numpy-like sub-class that contains 2 ndarray-like arrays, and is more appropriately implemented as a container of multiple-ndarrays. So when we added support for Categoricals recently, why didn't we say try to push a categorical dtype? I think their are several reasons, in no particular order: - pd.Categorical is really a container of multiple ndarrays, and is ndarray-like. Further its API is somewhat constrained. It was simpler to make a python container class rather than try to sub-class ndarray and basically override / throw out many methods (as a lot of computation methods simply don't make sense between 2 categoricals). You can make a case that this *should not * be in numpy for this reason. - The changes in pandas for the 3 cases outlined above, were mostly on how to integrate these with the top-level containers (Series/DataFrame), rather than actually writing / re-writing a new dtype for a ndarray class. We always try to reuse, so we just try to extend the ndarray-like rather than create a new one from scratch. - Getting for example a Categorical dtype into numpy prob would take a pretty long cycle time. I think you need a champion for new features to really push them. It hasn't happened with datetime and that's been a while (of course its possible that pandas diverted some of this need) - API design: I think this is a big issue actually. When I added Categorical container support, I didn't want to change the API of Categorical much (and it pretty much worked out that way, mainly adding to it). So, say we took the path of assuming that numpy would have a nice categorical data dtype. We would almost certainly have to wrap it in something to provided needed functionaility that would necessarily be missing in an initial version. (of course eventually that may not be necessary). - So the 'nobody wants to write in C' argument is true for datetimes, but not for SparseArray/Categorical. In fact much of that code is just calling out to numpy (though some cython code too). - from a performance perspective, numpy needs a really good hashtable in order to support proper factorizing, which @wesm co-opted klib to do (see this thread here for a discussion on this). So I know I am repeating myself, but it comes down to this. The API/interface of the delegated methods needs to be defined. For ndarrays it is long established and well-known. So easy to gear pandas to that. However with a *newer* type that is not the case, so pandas can easily decide, hey this is the most correct behavior, let's do it this way, nothing to break, no back compat needed. Jeff On Sun, Sep 21, 2014 at 11:31 PM, Nathaniel Smith wrote: > On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer wrote: > > pandas has some hacks to support custom types of data for which numpy > can't > > handle well enough or at all. Examples include datetime and Categorical > [1], > > and others like GeoArray [2] that haven't make it into pandas yet. > > > > Most of these look like numpy arrays but with custom dtypes and type > > specific methods/properties. But clearly nobody is particularly excited > > about writing the the C necessary to implement custom dtypes [3]. Nor is > do > > we need the ndarray ABI. > > > > In many cases, writing C may not actually even be necessary for > performance > > reasons, e.g., categorical can be fast enough just by wrapping an integer > > ndarray for the internal storage and using vectorized operations. And > even > > if it is necessary, I think we'd all rather write Cython than C. > > > > It's great for pandas to write its own ndarray-like wrappers (*not* > > subclasses) that work with pandas, but it's a shame that there isn't a > > standard interface like the ndarray to make these arrays useable for the > > rest of the scientific Python ecosystem. For example, pandas has loads of > > fixes for np.datetime64, but nobody seems to be up for porting them to > numpy > > (I doubt it would be easy). > > Writing them in the first place probably wasn't easy either :-). I > don't really know why pandas spends so much effort on reimplementing > stuff and papering over numpy limitations instead of fixing things > upstream so that everyone can benefit. I assume they have reasons, and > I could make some general guesses at what some of them might be, but > if you want to know what they are -- which is presumably the first > step in changing the situation -- you'll have to ask them, not us :-). > > > I know these sort of concerns are not new, but I wish I had a sense of > what > > the solution looks like. Is anyone actively working on these issues? Does > > the fix belong in numpy, pandas, blaze or a new project? I'd love to get > a > > sense of where things stand and how I could help -- without writing any C > > :). > > I think there are there are three parts: > > For stuff that's literally just fixing bugs in stuff that numpy > already has, then we'd certainly be happy to accept those bug fixes. > Probably there are things we can do to make this easier, I dunno. I'd > love to see some of numpy's internals moving into Cython to make them > easier to hack on, but this won't be simple because right now using > Cython to implement a module is really an all-or-nothing affair; > making it possible to mix Cython with numpy's existing C code will > require upstream changes in Cython. > > For cases where people genuinely want to implement a new array-like > types (e.g. DataFrame or scipy.sparse) then numpy provides a fair > amount of support for this already (e.g., the various hooks that allow > things like np.asarray(mydf) or np.sin(mydf) to work), and we're > working on adding more over time (e.g., __numpy_ufunc__). > > My feeling though is that in most of the cases you mention, > implementing a new array-like type is huge overkill. ndarray's > interface is vast and reimplementing even 90% of it is a huge effort. > For most of the cases that people seem to run into in practice, the > solution is to enhance numpy's dtype interface so that it's possible > for mere mortals to implement new dtypes, e.g. by just subclassing > np.dtype. This is totally doable and would enable a ton of > awesomeness, but it requires someone with the time to sit down and > work on it, and no-one has volunteered yet. Unfortunately it does > require hacking on C code though. > > -- > Nathaniel J. Smith > Postdoctoral researcher - Informatics - University of Edinburgh > http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Sep 23 02:42:13 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 22 Sep 2014 23:42:13 -0700 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Sun, Sep 21, 2014 at 8:31 PM, Nathaniel Smith wrote: > For cases where people genuinely want to implement a new array-like > types (e.g. DataFrame or scipy.sparse) then numpy provides a fair > amount of support for this already (e.g., the various hooks that allow > things like np.asarray(mydf) or np.sin(mydf) to work), and we're > working on adding more over time (e.g., __numpy_ufunc__). > Agreed, numpy does a great job of this. It has been a surprising pleasure to integrate with numpy for my custom array-like types in xray. __numpy_ufunc__ will let us add a few more neat tricks. > My feeling though is that in most of the cases you mention, > implementing a new array-like type is huge overkill. ndarray's > interface is vast and reimplementing even 90% of it is a huge effort. > For most of the cases that people seem to run into in practice, the > solution is to enhance numpy's dtype interface so that it's possible > for mere mortals to implement new dtypes, e.g. by just subclassing > np.dtype. This is totally doable and would enable a ton of > awesomeness, but it requires someone with the time to sit down and > work on it, and no-one has volunteered yet. Unfortunately it does > require hacking on C code though. > Something to allow mere mortals such as myself to implement new dtypes sounds wonderful! Would it be useful to prototype something like this in pure Python? That sounds like a task that I could be up for. Like I said, I expect a (mostly) pure Python solution, at least for categorical and datetime, would be a more maintainable and even performant enough for use in pandas (given that this is basically the current approach), as long as the bottlenecks are dealt with appropriately. Anyone else interested in hacking on this with me? For what it's worth, I am not convinced that it is that terrible to reimplement most of the ndarray interface. As long as your object looks pretty much like an ndarray with a custom dtype, it should be quite straightforward to wrap the underlying array's methods/properties. So I'm not too scared of that option, although I agree that it is a complete waste to do it again and again. Nathaniel and Jeff -- thank you so much for detailed replies. Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From cournape at gmail.com Tue Sep 23 03:19:12 2014 From: cournape at gmail.com (David Cournapeau) Date: Tue, 23 Sep 2014 08:19:12 +0100 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Mon, Sep 22, 2014 at 4:31 AM, Nathaniel Smith wrote: > On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer wrote: > > pandas has some hacks to support custom types of data for which numpy > can't > > handle well enough or at all. Examples include datetime and Categorical > [1], > > and others like GeoArray [2] that haven't make it into pandas yet. > > > > Most of these look like numpy arrays but with custom dtypes and type > > specific methods/properties. But clearly nobody is particularly excited > > about writing the the C necessary to implement custom dtypes [3]. Nor is > do > > we need the ndarray ABI. > > > > In many cases, writing C may not actually even be necessary for > performance > > reasons, e.g., categorical can be fast enough just by wrapping an integer > > ndarray for the internal storage and using vectorized operations. And > even > > if it is necessary, I think we'd all rather write Cython than C. > > > > It's great for pandas to write its own ndarray-like wrappers (*not* > > subclasses) that work with pandas, but it's a shame that there isn't a > > standard interface like the ndarray to make these arrays useable for the > > rest of the scientific Python ecosystem. For example, pandas has loads of > > fixes for np.datetime64, but nobody seems to be up for porting them to > numpy > > (I doubt it would be easy). > > Writing them in the first place probably wasn't easy either :-). I > don't really know why pandas spends so much effort on reimplementing > stuff and papering over numpy limitations instead of fixing things > upstream so that everyone can benefit. I assume they have reasons, and > I could make some general guesses at what some of them might be, but > if you want to know what they are -- which is presumably the first > step in changing the situation -- you'll have to ask them, not us :-). > > > I know these sort of concerns are not new, but I wish I had a sense of > what > > the solution looks like. Is anyone actively working on these issues? Does > > the fix belong in numpy, pandas, blaze or a new project? I'd love to get > a > > sense of where things stand and how I could help -- without writing any C > > :). > > I think there are there are three parts: > > For stuff that's literally just fixing bugs in stuff that numpy > already has, then we'd certainly be happy to accept those bug fixes. > Probably there are things we can do to make this easier, I dunno. I'd > love to see some of numpy's internals moving into Cython to make them > easier to hack on, but this won't be simple because right now using > Cython to implement a module is really an all-or-nothing affair; > making it possible to mix Cython with numpy's existing C code will > require upstream changes in Cython. > For cases where people genuinely want to implement a new array-like > types (e.g. DataFrame or scipy.sparse) then numpy provides a fair > amount of support for this already (e.g., the various hooks that allow > things like np.asarray(mydf) or np.sin(mydf) to work), and we're > working on adding more over time (e.g., __numpy_ufunc__). > > My feeling though is that in most of the cases you mention, > implementing a new array-like type is huge overkill. ndarray's > interface is vast and reimplementing even 90% of it is a huge effort. > For most of the cases that people seem to run into in practice, the > solution is to enhance numpy's dtype interface so that it's possible > for mere mortals to implement new dtypes, e.g. by just subclassing > np.dtype. This is totally doable and would enable a ton of > awesomeness, but it requires someone with the time to sit down and > work on it, and no-one has volunteered yet. Unfortunately it does > require hacking on C code though. > While preparing my tutorial on NumPy C internals 1 year ago, I tried to get a "basic" dtype implemented in cython, and there were various issues even if you wanted to do all of it in cython (I can't remember the details now). Solving this would be a good first step. There were (are ?) also some issues regarding precedence in ufuncs depending on the new dtype: numpy hardcodes that long double is the highest precision floating point type, for example, and there were similar issues regarding datetime handling. Does not matter for completely new types that don't require interactions with others (categorical ?). Would it help to prepare a set of "implement your own dtype" notebooks ? I have a starting point from last year tutorial (the corresponding slides were never shown for lack of time). David > > -- > Nathaniel J. Smith > Postdoctoral researcher - Informatics - University of Edinburgh > http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From toddrjen at gmail.com Tue Sep 23 05:31:46 2014 From: toddrjen at gmail.com (Todd) Date: Tue, 23 Sep 2014 11:31:46 +0200 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Mon, Sep 22, 2014 at 5:31 AM, Nathaniel Smith wrote: > On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer wrote: > My feeling though is that in most of the cases you mention, > implementing a new array-like type is huge overkill. ndarray's > interface is vast and reimplementing even 90% of it is a huge effort. > For most of the cases that people seem to run into in practice, the > solution is to enhance numpy's dtype interface so that it's possible > for mere mortals to implement new dtypes, e.g. by just subclassing > np.dtype. This is totally doable and would enable a ton of > awesomeness, but it requires someone with the time to sit down and > work on it, and no-one has volunteered yet. Unfortunately it does > require hacking on C code though. > I'm unclear about the last sentence. Do you mean "improving the dtype system will require hacking on C code" or "even if we improve the dtype system dtypes will still have to be written in C"? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewm at redtetrahedron.org Tue Sep 23 07:40:23 2014 From: ewm at redtetrahedron.org (Eric Moore) Date: Tue, 23 Sep 2014 07:40:23 -0400 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Tuesday, September 23, 2014, Todd wrote: > On Mon, Sep 22, 2014 at 5:31 AM, Nathaniel Smith > wrote: > >> On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer > > wrote: >> My feeling though is that in most of the cases you mention, >> implementing a new array-like type is huge overkill. ndarray's >> interface is vast and reimplementing even 90% of it is a huge effort. >> For most of the cases that people seem to run into in practice, the >> solution is to enhance numpy's dtype interface so that it's possible >> for mere mortals to implement new dtypes, e.g. by just subclassing >> np.dtype. This is totally doable and would enable a ton of >> awesomeness, but it requires someone with the time to sit down and >> work on it, and no-one has volunteered yet. Unfortunately it does >> require hacking on C code though. >> > > I'm unclear about the last sentence. Do you mean "improving the dtype > system will require hacking on C code" or "even if we improve the dtype > system dtypes will still have to be written in C"? > What ends up making this hard is every place numpy does anything with a dtype needs at least audited and probably changed. All of that is in c right now, and most of it would likely still be after the fact, simply because the rest of numpy is in c. Improving the dtype system requires working on c code. Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From travis at continuum.io Tue Sep 23 09:34:28 2014 From: travis at continuum.io (Travis Oliphant) Date: Tue, 23 Sep 2014 08:34:28 -0500 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Sun, Sep 21, 2014 at 6:50 PM, Stephan Hoyer wrote: > pandas has some hacks to support custom types of data for which numpy > can't handle well enough or at all. Examples include datetime and > Categorical [1], and others like GeoArray [2] that haven't make it into > pandas yet. > > Most of these look like numpy arrays but with custom dtypes and type > specific methods/properties. But clearly nobody is particularly excited > about writing the the C necessary to implement custom dtypes [3]. Nor is do > we need the ndarray ABI. > > In many cases, writing C may not actually even be necessary for > performance reasons, e.g., categorical can be fast enough just by wrapping > an integer ndarray for the internal storage and using vectorized > operations. And even if it is necessary, I think we'd all rather write > Cython than C. > > It's great for pandas to write its own ndarray-like wrappers (*not* > subclasses) that work with pandas, but it's a shame that there isn't a > standard interface like the ndarray to make these arrays useable for the > rest of the scientific Python ecosystem. For example, pandas has loads of > fixes for np.datetime64, but nobody seems to be up for porting them to > numpy (I doubt it would be easy). > > I know these sort of concerns are not new, but I wish I had a sense of > what the solution looks like. Is anyone actively working on these issues? > Does the fix belong in numpy, pandas, blaze or a new project? I'd love to > get a sense of where things stand and how I could help -- without writing > any C :). > > Hey Stephan, There are not easy answers to your questions. The reason is that NumPy's dtype system is not extensible enough with its fixed set of "builtin" data-types and its bolted-on "user-defined" datatypes. The implementation was adapted from the *descriptor* notion that was in Numeric (written almost 20 years ago). While a significant improvement over Numeric, the dtype system in NumPy still has several limitations: 1) it was not designed to add new fundamental data-types without breaking the ABI (most of the ABI breakage between 1.3 and 1.7 due to the addition of np.datetime has been pushed to a small corner but it is still there). 2) The user-defined data-type system which is present is not well tested and likely incomplete: it was the best I could come up with at the time NumPy first came out with a bit of input from people like Fernando Perez and Francesc Alted. 3) It is far easier than in Numeric to add new data-types (that was a big part of the effort of NumPy), but it is still not as easy as one would like to add new data-types (either fundamental ones requiring recompilation of NumPy or 'user-defined' data-types requiring C-code. I believe this system has served us well, but it needs to be replaced eventually. I think it can be replaced fairly seamlessly in a largely backward compatible way (though requiring re-compilation of dependencies). Fixing the dtype system is a fundamental effort behind several projects we are working on at Continuum: datashape, dynd, and numba. These projects are addressing fundamental limitations in a way that can lead to a significantly improved framework for scientific and tabular computing in Python. In the mean-time, NumPy can continue to improve in small ways and in orthogonal ways (like the new __numpy_ufunc__ mechanism which allows ufuncs to work more seamlessly with different kinds of array-like objects). This kind of effort as well as the improved buffer protocol in Python, mean that multiple array-like objects can co-exist and use each-other's data. Right now, I think that is the best current way to address the data-type limitations of NumPy. Another small project is possible today --- one could today use Numba or Cython to generate user-defined data-types for existing NumPy. That would be an interesting project and would certainly help to understand the limitations of the user-defined data-type framework without making people write C-code. You could use a meta-class and some code-generation techniques so that by defining a particular class you end-up with a user-defined data-type for NumPy. Even while we have been addressing the fundamental limitations of NumPy with our new tools at Continuum, replacing NumPy is a big undertaking because of its large user-base. While I personally think that NumPy could be replaced for new users as early as next year with a combination of dynd and numba, the big install base of NumPy means that many people (including the company I work with, Continuum) will be supporting NumPy 1.X and Pandas and the rest of the NumPy-Stack for many years to come. So, even if you see me working and advocating new technology, that should never be construed as somehow ignoring or abandoning the current technology base. I remain deeply interested in the success of the scientific computing community --- even though I am not currently contributing a lot of code directly myself. As dynd and numba mature, I think it will be clear to more people how to proceed. For example, just recently the thought emerged that because dynd addresses some of the major needs that Pandas has, it may be possible very soon for dynd to replace NumPy as the foundational container for Pandas data-frames. Because Pandas use of the NumPy API is limited, this is an easier undertaking than having dynd replace NumPy itself. And given that the new data-types of dynd: missing-data, categorical types, variable-length strings, etc. are some of the key areas that Pandas has work-arounds for, it may be a straight-forward project. For those not aware: dynd is cython code that wraps the C++ library libdynd. Currently libdynd is not complete, so working on dynd may require some improvements / fixes to libdynd. However, the dynd layer should be accessible to many people. The libdynd layer is also fairly straightforward C++. I strongly believe that the combination of libdynd and dynd is a much easier foundation to work on and maintain than the NumPy code base. I say this after having personally spent over a decade on the Numeric code-base and then the NumPy code base. The NumPy "C" code-base has been improved since I left it by the excellent work of several patient developers --- but it is not easy to transmit the knowledge necessary to understand the code-base sufficient to maintain it without creating backward compatibility issues. So, while I continue to support the NumPy code base and its extensions (personally, through Numfocus, and through Continuum) and believe it will be relevant for many years, I also believe the future lies in renewing the NumPy code base with a combination of dynd and numba with more emphasis on the high-level APIs like pandas and blaze. The good news is that this means: 1) a lot more code in Python or Cython, 2) compatibility with the PyPy world as part of a long term effort to heal the rift that exists between scientific-use of Python and "web-use" of Python. In the end, all of this is good news for Python and scientific computing. More and better tools will continue to be written with better interop between them. There are many places to jump in and help: dynd, libdynd, datashape, blaze, numba, scipy, scikits, numpy, pandas, and even a new project you create that enhances some aspect of any of these or does something like use Cython or Numba to create NumPy user-defined data-types from a Python class-specification. I agree it can be hard to know where things will eventually end up and so therefore where to spend your effort. All I can tell you is what I've decided and where I am pushing and promoting. Best, -Travis -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue Sep 23 09:55:16 2014 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 23 Sep 2014 09:55:16 -0400 Subject: [Numpy-discussion] Dataframe memory info printing Message-ID: For the 0.15.0 release of pandas (coming 2nd week of oct), we are going to include memory info printing: see here: https://github.com/pydata/pandas/pull/7619 This will be controllable by an option display.memory_usage. My question to the community should this be by default True, e.g. show the memory usage (this only applies to using df.info()) There is really no performance impact here. Pls let us know! thanks Jeff >>> df.info(memory_usage=True)Int64Index: 10000000 entries, 0 to 9999999Data columns (total 5 columns):date datetime64[ns]float float64int int64smallint int16string objectdtypes: datetime64[ns](1), float64(1), int16(1), int64(1), object(1)memory usage: 324.2 MB -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Tue Sep 23 10:00:16 2014 From: ben.root at ou.edu (Benjamin Root) Date: Tue, 23 Sep 2014 10:00:16 -0400 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: Travis, Thank you for your perspective on this issue. Such input is always valuable in helping us see where we came from and where we might go. My perspective on NumPy is fairly different, having come into Python right after the whole Numeric/NumArray transition to NumPy. One of the things that really sold me on NumPy was not only just how simple it was for me to use out of the box, but how easy it was to explicitly state that something needed to be of one type or another. The dtype notation was fairly simple and straight-forward. -- We should not underestimate the value of simple to write and simple to read notations in Python -- We can go ahead and put as many bells and whistles into the underlaying infrastructure as you want, but if we can't design a simple notation language to utilize it, then it will never catch on. This isn't criticism of the work being done in dynd or the or the other projects, rather, it is a call for innovation. I don't know how I would design such a notation language, but we need that "ah-ha!" moment from *somebody*. I expressed this back at the NumPy BoF this summer. I would love an improved notation system that Matplotlib could take advantage of that would facilitate the plotting of more complicated graphs. But I am also not really interested in seeing NumPy turn into Pandas. Nothing wrong with Pandas; I just like the idea of modularity and I think it has suited the community well. Striking the right balance is going to be extremely important. Cheers! Ben Root On Tue, Sep 23, 2014 at 9:34 AM, Travis Oliphant wrote: > > On Sun, Sep 21, 2014 at 6:50 PM, Stephan Hoyer wrote: > >> pandas has some hacks to support custom types of data for which numpy >> can't handle well enough or at all. Examples include datetime and >> Categorical [1], and others like GeoArray [2] that haven't make it into >> pandas yet. >> >> Most of these look like numpy arrays but with custom dtypes and type >> specific methods/properties. But clearly nobody is particularly excited >> about writing the the C necessary to implement custom dtypes [3]. Nor is do >> we need the ndarray ABI. >> >> In many cases, writing C may not actually even be necessary for >> performance reasons, e.g., categorical can be fast enough just by wrapping >> an integer ndarray for the internal storage and using vectorized >> operations. And even if it is necessary, I think we'd all rather write >> Cython than C. >> >> It's great for pandas to write its own ndarray-like wrappers (*not* >> subclasses) that work with pandas, but it's a shame that there isn't a >> standard interface like the ndarray to make these arrays useable for the >> rest of the scientific Python ecosystem. For example, pandas has loads of >> fixes for np.datetime64, but nobody seems to be up for porting them to >> numpy (I doubt it would be easy). >> >> I know these sort of concerns are not new, but I wish I had a sense of >> what the solution looks like. Is anyone actively working on these issues? >> Does the fix belong in numpy, pandas, blaze or a new project? I'd love to >> get a sense of where things stand and how I could help -- without writing >> any C :). >> >> > Hey Stephan, > > There are not easy answers to your questions. The reason is that NumPy's > dtype system is not extensible enough with its fixed set of "builtin" > data-types and its bolted-on "user-defined" datatypes. The implementation > was adapted from the *descriptor* notion that was in Numeric (written > almost 20 years ago). While a significant improvement over Numeric, the > dtype system in NumPy still has several limitations: > > 1) it was not designed to add new fundamental data-types without > breaking the ABI (most of the ABI breakage between 1.3 and 1.7 due to the > addition of np.datetime has been pushed to a small corner but it is still > there). > > 2) The user-defined data-type system which is present is not well > tested and likely incomplete: it was the best I could come up with at the > time NumPy first came out with a bit of input from people like Fernando > Perez and Francesc Alted. > > 3) It is far easier than in Numeric to add new data-types (that was a > big part of the effort of NumPy), but it is still not as easy as one would > like to add new data-types (either fundamental ones requiring recompilation > of NumPy or 'user-defined' data-types requiring C-code. > > I believe this system has served us well, but it needs to be replaced > eventually. I think it can be replaced fairly seamlessly in a largely > backward compatible way (though requiring re-compilation of dependencies). > Fixing the dtype system is a fundamental effort behind several projects > we are working on at Continuum: datashape, dynd, and numba. These > projects are addressing fundamental limitations in a way that can lead to a > significantly improved framework for scientific and tabular computing in > Python. > > In the mean-time, NumPy can continue to improve in small ways and in > orthogonal ways (like the new __numpy_ufunc__ mechanism which allows ufuncs > to work more seamlessly with different kinds of array-like objects). > This kind of effort as well as the improved buffer protocol in Python, > mean that multiple array-like objects can co-exist and use each-other's > data. Right now, I think that is the best current way to address the > data-type limitations of NumPy. > > Another small project is possible today --- one could today use Numba or > Cython to generate user-defined data-types for existing NumPy. That would > be an interesting project and would certainly help to understand the > limitations of the user-defined data-type framework without making people > write C-code. You could use a meta-class and some code-generation > techniques so that by defining a particular class you end-up with a > user-defined data-type for NumPy. > > Even while we have been addressing the fundamental limitations of NumPy > with our new tools at Continuum, replacing NumPy is a big undertaking > because of its large user-base. While I personally think that NumPy could > be replaced for new users as early as next year with a combination of dynd > and numba, the big install base of NumPy means that many people (including > the company I work with, Continuum) will be supporting NumPy 1.X and Pandas > and the rest of the NumPy-Stack for many years to come. > > So, even if you see me working and advocating new technology, that should > never be construed as somehow ignoring or abandoning the current technology > base. I remain deeply interested in the success of the scientific > computing community --- even though I am not currently contributing a lot > of code directly myself. As dynd and numba mature, I think it will be > clear to more people how to proceed. > > For example, just recently the thought emerged that because dynd addresses > some of the major needs that Pandas has, it may be possible very soon for > dynd to replace NumPy as the foundational container for Pandas data-frames. > Because Pandas use of the NumPy API is limited, this is an easier > undertaking than having dynd replace NumPy itself. And given that the new > data-types of dynd: missing-data, categorical types, variable-length > strings, etc. are some of the key areas that Pandas has work-arounds for, > it may be a straight-forward project. > > For those not aware: dynd is cython code that wraps the C++ library > libdynd. Currently libdynd is not complete, so working on dynd may > require some improvements / fixes to libdynd. However, the dynd layer > should be accessible to many people. The libdynd layer is also fairly > straightforward C++. I strongly believe that the combination of libdynd > and dynd is a much easier foundation to work on and maintain than the NumPy > code base. I say this after having personally spent over a decade on > the Numeric code-base and then the NumPy code base. The NumPy "C" > code-base has been improved since I left it by the excellent work of > several patient developers --- but it is not easy to transmit the knowledge > necessary to understand the code-base sufficient to maintain it without > creating backward compatibility issues. > > So, while I continue to support the NumPy code base and its extensions > (personally, through Numfocus, and through Continuum) and believe it will > be relevant for many years, I also believe the future lies in renewing the > NumPy code base with a combination of dynd and numba with more emphasis on > the high-level APIs like pandas and blaze. The good news is that this > means: 1) a lot more code in Python or Cython, 2) compatibility with the > PyPy world as part of a long term effort to heal the rift that exists > between scientific-use of Python and "web-use" of Python. > > In the end, all of this is good news for Python and scientific computing. > More and better tools will continue to be written with better interop > between them. There are many places to jump in and help: dynd, libdynd, > datashape, blaze, numba, scipy, scikits, numpy, pandas, and even a new > project you create that enhances some aspect of any of these or does > something like use Cython or Numba to create NumPy user-defined data-types > from a Python class-specification. > > I agree it can be hard to know where things will eventually end up and so > therefore where to spend your effort. All I can tell you is what I've > decided and where I am pushing and promoting. > > Best, > > -Travis > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 23 18:59:46 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 23 Sep 2014 16:59:46 -0600 Subject: [Numpy-discussion] Changes to the generalized functions. Message-ID: Hi All, The question has come up as the whether of not to treat the new gufunc behavior as a bug fix, keeping the old constructor name, or have a different constructor. Keeping the name makes life easier as we don't need to edit the code where numpy currently uses gufuncs, but is risky if some third party depends on the old behavior. The gufuncs have been part of numpy since the 1.3 release, and google doesn't turn up any uses that I can see apart from repeats of numpy code. We can also make fixes if needed during the 1.10 beta release cycle. Even so, it is a bit of a risk. To spread the blame, if any, please weigh in on the following. 1. Yes, it is a bug, keep the current name and fix the behavior. 2. No, we need to be conservative and use a new function. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Sep 23 19:31:30 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 23 Sep 2014 17:31:30 -0600 Subject: [Numpy-discussion] Changes to the generalized functions. In-Reply-To: References: Message-ID: On Tue, Sep 23, 2014 at 4:59 PM, Charles R Harris wrote: > Hi All, > > The question has come up as the whether of not to treat the new gufunc > behavior as a bug fix, keeping the old constructor name, or have a > different constructor. Keeping the name makes life easier as we don't need > to edit the code where numpy currently uses gufuncs, but is risky if some > third party depends on the old behavior. The gufuncs have been part of > numpy since the 1.3 release, and google doesn't turn up any uses that I can > see apart from repeats of numpy code. We can also make fixes if needed > during the 1.10 beta release cycle. Even so, it is a bit of a risk. To > spread the blame, if any, please weigh in on the following. > > > 1. Yes, it is a bug, keep the current name and fix the behavior. > 2. No, we need to be conservative and use a new function. > > To clarify the changes. Currently, if an input array does not have as many dimensions as indicated by the signature, it is filled out with ones to match the signature as well as broadcasting to the other array. It is also the case that if a dimension in the signature is 1, then it is broadcast with the data in the other signature. That is, the parts identified in the signature are treated as little arrays. The proposed change is to require that the inputs have at least as many dimensions as the signature and that dimensions of 1 are not broadcast. That is, the parts identified in the signature are treated as vectors or matrices rather than little arrays. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From huruomu at gmail.com Tue Sep 23 23:21:15 2014 From: huruomu at gmail.com (Romu Hu) Date: Wed, 24 Sep 2014 11:21:15 +0800 Subject: [Numpy-discussion] All numpy-f2py tests failed Message-ID: <542238AB.2090408@gmail.com> Hi, I'm using python27-numpy-f2py-1.7.1-9.el6.x86_64 from RHEL6, the package has a test directory "/usr/lib64/python2.7/site-packages/numpy/f2py/tests", when I run unittest in the directory, all 358 testcases fail: # cd /usr/lib64/python2.7/site-packages/numpy/f2py/tests # python27 -m unittest discover -v ====================================================================== ERROR: test_c_copy_in_from_23casttype (test_array_from_pyobj.test_BOOL_gen) ---------------------------------------------------------------------- Traceback (most recent call last): File "", line 5, in setUp File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 126, in __new__ obj._init(name) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 132, in _init self.type_num = getattr(wrap,'NPY_'+self.NAME) AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' ====================================================================== ERROR: test_c_in_from_23casttype (test_array_from_pyobj.test_BOOL_gen) ---------------------------------------------------------------------- Traceback (most recent call last): File "", line 5, in setUp File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 126, in __new__ obj._init(name) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 132, in _init self.type_num = getattr(wrap,'NPY_'+self.NAME) AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' ...... ...... ====================================================================== ERROR: test_optional_from_23seq (test_array_from_pyobj.test_USHORT_gen) ---------------------------------------------------------------------- Traceback (most recent call last): File "", line 5, in setUp File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 126, in __new__ obj._init(name) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 132, in _init self.type_num = getattr(wrap,'NPY_'+self.NAME) AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' ====================================================================== ERROR: test_optional_from_2seq (test_array_from_pyobj.test_USHORT_gen) ---------------------------------------------------------------------- Traceback (most recent call last): File "", line 5, in setUp File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 126, in __new__ obj._init(name) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 132, in _init self.type_num = getattr(wrap,'NPY_'+self.NAME) AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' ====================================================================== ERROR: test_optional_none (test_array_from_pyobj.test_USHORT_gen) ---------------------------------------------------------------------- Traceback (most recent call last): File "", line 5, in setUp File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 126, in __new__ obj._init(name) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 132, in _init self.type_num = getattr(wrap,'NPY_'+self.NAME) AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' ====================================================================== ERROR: test_in_out (test_array_from_pyobj.test_intent) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 270, in test_in_out assert_equal(str(intent.in_.out),'intent(in,out)') File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 67, in __getattr__ return self.__class__(self.intent_list+[name]) File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", line 62, in __init__ flags |= getattr(wrap,'F2PY_INTENT_'+i.upper()) AttributeError: 'NoneType' object has no attribute 'F2PY_INTENT_IN' ---------------------------------------------------------------------- Ran 358 tests in 0.047s FAILED (errors=358) It seems that all tests fail because the 'wrap' variable is None. Am I using a wrong way to run the tests? Any idea? Thanks Romu From charlesr.harris at gmail.com Wed Sep 24 00:20:54 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 23 Sep 2014 22:20:54 -0600 Subject: [Numpy-discussion] All numpy-f2py tests failed In-Reply-To: <542238AB.2090408@gmail.com> References: <542238AB.2090408@gmail.com> Message-ID: On Tue, Sep 23, 2014 at 9:21 PM, Romu Hu wrote: > Hi, > > I'm using python27-numpy-f2py-1.7.1-9.el6.x86_64 from RHEL6, the package > has a test directory > "/usr/lib64/python2.7/site-packages/numpy/f2py/tests", when I run > unittest in the directory, all 358 testcases fail: > > # cd /usr/lib64/python2.7/site-packages/numpy/f2py/tests > # python27 -m unittest discover -v > > ====================================================================== > ERROR: test_c_copy_in_from_23casttype (test_array_from_pyobj.test_BOOL_gen) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "", line 5, in setUp > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 126, in __new__ > obj._init(name) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 132, in _init > self.type_num = getattr(wrap,'NPY_'+self.NAME) > AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' > > ====================================================================== > ERROR: test_c_in_from_23casttype (test_array_from_pyobj.test_BOOL_gen) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "", line 5, in setUp > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 126, in __new__ > obj._init(name) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 132, in _init > self.type_num = getattr(wrap,'NPY_'+self.NAME) > AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' > > ...... > ...... > > ====================================================================== > ERROR: test_optional_from_23seq (test_array_from_pyobj.test_USHORT_gen) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "", line 5, in setUp > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 126, in __new__ > obj._init(name) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 132, in _init > self.type_num = getattr(wrap,'NPY_'+self.NAME) > AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' > > ====================================================================== > ERROR: test_optional_from_2seq (test_array_from_pyobj.test_USHORT_gen) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "", line 5, in setUp > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 126, in __new__ > obj._init(name) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 132, in _init > self.type_num = getattr(wrap,'NPY_'+self.NAME) > AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' > > ====================================================================== > ERROR: test_optional_none (test_array_from_pyobj.test_USHORT_gen) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "", line 5, in setUp > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 126, in __new__ > obj._init(name) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 132, in _init > self.type_num = getattr(wrap,'NPY_'+self.NAME) > AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' > > ====================================================================== > ERROR: test_in_out (test_array_from_pyobj.test_intent) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 270, in test_in_out > assert_equal(str(intent.in_.out),'intent(in,out)') > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 67, in __getattr__ > return self.__class__(self.intent_list+[name]) > File > > "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", > line 62, in __init__ > flags |= getattr(wrap,'F2PY_INTENT_'+i.upper()) > AttributeError: 'NoneType' object has no attribute 'F2PY_INTENT_IN' > > ---------------------------------------------------------------------- > Ran 358 tests in 0.047s > > FAILED (errors=358) > > > It seems that all tests fail because the 'wrap' variable is None. Am I > using a wrong way to run the tests? Any idea? > > Looks pretty drastic, but I expect the problem is that numpy uses nose for running tests, not unittest. Try charris at localhost [tests (master)]$ nosetests ---------------------------------------------------------------------- Ran 379 tests in 12.457s OK Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 24 00:30:55 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 23 Sep 2014 22:30:55 -0600 Subject: [Numpy-discussion] All numpy-f2py tests failed In-Reply-To: References: <542238AB.2090408@gmail.com> Message-ID: On Tue, Sep 23, 2014 at 10:20 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Tue, Sep 23, 2014 at 9:21 PM, Romu Hu wrote: > >> Hi, >> >> I'm using python27-numpy-f2py-1.7.1-9.el6.x86_64 from RHEL6, the package >> has a test directory >> "/usr/lib64/python2.7/site-packages/numpy/f2py/tests", when I run >> unittest in the directory, all 358 testcases fail: >> >> # cd /usr/lib64/python2.7/site-packages/numpy/f2py/tests >> # python27 -m unittest discover -v >> >> ====================================================================== >> ERROR: test_c_copy_in_from_23casttype >> (test_array_from_pyobj.test_BOOL_gen) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "", line 5, in setUp >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 126, in __new__ >> obj._init(name) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 132, in _init >> self.type_num = getattr(wrap,'NPY_'+self.NAME) >> AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' >> >> ====================================================================== >> ERROR: test_c_in_from_23casttype (test_array_from_pyobj.test_BOOL_gen) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "", line 5, in setUp >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 126, in __new__ >> obj._init(name) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 132, in _init >> self.type_num = getattr(wrap,'NPY_'+self.NAME) >> AttributeError: 'NoneType' object has no attribute 'NPY_BOOL' >> >> ...... >> ...... >> >> ====================================================================== >> ERROR: test_optional_from_23seq (test_array_from_pyobj.test_USHORT_gen) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "", line 5, in setUp >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 126, in __new__ >> obj._init(name) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 132, in _init >> self.type_num = getattr(wrap,'NPY_'+self.NAME) >> AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' >> >> ====================================================================== >> ERROR: test_optional_from_2seq (test_array_from_pyobj.test_USHORT_gen) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "", line 5, in setUp >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 126, in __new__ >> obj._init(name) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 132, in _init >> self.type_num = getattr(wrap,'NPY_'+self.NAME) >> AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' >> >> ====================================================================== >> ERROR: test_optional_none (test_array_from_pyobj.test_USHORT_gen) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "", line 5, in setUp >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 126, in __new__ >> obj._init(name) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 132, in _init >> self.type_num = getattr(wrap,'NPY_'+self.NAME) >> AttributeError: 'NoneType' object has no attribute 'NPY_USHORT' >> >> ====================================================================== >> ERROR: test_in_out (test_array_from_pyobj.test_intent) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 270, in test_in_out >> assert_equal(str(intent.in_.out),'intent(in,out)') >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 67, in __getattr__ >> return self.__class__(self.intent_list+[name]) >> File >> >> "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py", >> line 62, in __init__ >> flags |= getattr(wrap,'F2PY_INTENT_'+i.upper()) >> AttributeError: 'NoneType' object has no attribute 'F2PY_INTENT_IN' >> >> ---------------------------------------------------------------------- >> Ran 358 tests in 0.047s >> >> FAILED (errors=358) >> >> >> It seems that all tests fail because the 'wrap' variable is None. Am I >> using a wrong way to run the tests? Any idea? >> >> > Looks pretty drastic, but I expect the problem is that numpy uses nose for > running tests, not unittest. Try > > charris at localhost [tests (master)]$ nosetests > > > > ---------------------------------------------------------------------- > Ran 379 tests in 12.457s > > OK > > But this doesn't work with the installed package. From ipython, or a terminal, do In [8]: np.f2py.test() Running unit tests for numpy.f2py NumPy version 1.10.0.dev-4083883 NumPy is installed in /home/charris/.local/lib/python2.7/site-packages/numpy Python version 2.7.5 (default, Jun 25 2014, 10:19:55) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] nose version 1.3.0 ....................................................................................................................................................................................................................................................................................................................................................................... ---------------------------------------------------------------------- Ran 359 tests in 1.737s OK Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 24 00:52:25 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 23 Sep 2014 22:52:25 -0600 Subject: [Numpy-discussion] All numpy-f2py tests failed In-Reply-To: References: <542238AB.2090408@gmail.com> Message-ID: On Tue, Sep 23, 2014 at 10:30 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Tue, Sep 23, 2014 at 10:20 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> Looks like you need to do >>> import numpy.f2py as f2py >>> f2py.test() To run tests with the redhat package. Redhat splits it up, so things are a bit weird. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From huruomu at gmail.com Wed Sep 24 01:48:14 2014 From: huruomu at gmail.com (Romu Hu) Date: Wed, 24 Sep 2014 13:48:14 +0800 Subject: [Numpy-discussion] All numpy-f2py tests failed In-Reply-To: References: <542238AB.2090408@gmail.com> Message-ID: <54225B1E.1030202@gmail.com> On 2014/9/24 12:52, Charles R Harris wrote: > On Tue, Sep 23, 2014 at 10:30 PM, Charles R Harris > > wrote: > > On Tue, Sep 23, 2014 at 10:20 PM, Charles R Harris > > wrote: > > > > > > > Looks like you need to do > > >>> import numpy.f2py as f2py > >>> f2py.test() I tried it but no test was run: Run python in the python27 env: $ scl enable python27 "python" Python 2.7.5 (default, Aug 13 2014, 02:49:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> Run the test: >>> import numpy.f2py as f2py >>> f2py.test() Running unit tests for numpy.f2py NumPy version 1.7.1 NumPy is installed in /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy Python version 2.7.5 (default, Aug 13 2014, 02:49:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] nose version 1.3.0 S ---------------------------------------------------------------------- Ran 0 tests in 0.452s OK (SKIP=1) Files of the python27-numpy-f2py-1.7.1-9.el6.x86_64 package: /opt/rh/python27/root/usr/bin/f2py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__init__.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__init__.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__init__.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__version__.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__version__.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/__version__.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/auxfuncs.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/auxfuncs.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/auxfuncs.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/capi_maps.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/capi_maps.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/capi_maps.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cb_rules.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cb_rules.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cb_rules.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cfuncs.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cfuncs.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/cfuncs.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/common_rules.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/common_rules.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/common_rules.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/crackfortran.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/crackfortran.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/crackfortran.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/diagnose.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/diagnose.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/diagnose.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py2e.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py2e.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py2e.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py_testing.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py_testing.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f2py_testing.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f90mod_rules.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f90mod_rules.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/f90mod_rules.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/func2subr.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/func2subr.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/func2subr.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/info.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/info.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/info.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/rules.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/rules.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/rules.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setup.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setup.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setup.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setupscons.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setupscons.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/setupscons.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/src /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/src/fortranobject.c /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/src/fortranobject.h /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/array_from_pyobj /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/array_from_pyobj/wrapmodule.c /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape/.f2py_f2cmap /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape/foo_free.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape/foo_mod.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape/foo_use.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/assumed_shape/precision.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/kind /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/kind/foo.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/mixed /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/mixed/foo.f /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/mixed/foo_fixed.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/mixed/foo_free.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/size /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/src/size/foo.f90 /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_array_from_pyobj.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_assumed_shape.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_assumed_shape.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_assumed_shape.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_callback.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_callback.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_callback.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_kind.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_kind.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_kind.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_mixed.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_mixed.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_mixed.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_character.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_character.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_character.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_complex.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_complex.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_complex.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_integer.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_integer.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_integer.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_logical.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_logical.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_logical.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_real.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_real.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_return_real.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_size.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_size.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/test_size.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/util.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/util.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/tests/util.pyo /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/use_rules.py /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/use_rules.pyc /opt/rh/python27/root/usr/lib64/python2.7/site-packages/numpy/f2py/use_rules.pyo /opt/rh/python27/root/usr/share/man/man1/f2py.1.gz Thanks Romu -------------- next part -------------- An HTML attachment was scrubbed... URL: From saullogiovani at gmail.com Wed Sep 24 12:36:05 2014 From: saullogiovani at gmail.com (Saullo Castro) Date: Wed, 24 Sep 2014 18:36:05 +0200 Subject: [Numpy-discussion] PR #5109 - interpolation function for polar coordinates Message-ID: Dear all, today I've submitted a pull request: https://github.com/numpy/numpy/pull/5109 in order to include an interpolation function for angular coordinates, since using np.interp for this purpose is cumbersome. I kindly ask for your feedback and/or questions. Greetings, Saullo -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Sep 24 13:08:29 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 24 Sep 2014 10:08:29 -0700 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: On Tue, Sep 23, 2014 at 4:40 AM, Eric Moore wrote: > Improving the dtype system requires working on c code. > yes -- it sure does. But I think that is a bit of a Red Herring. I'm barely competent in C, and don't like it much, but the real barrier to entry for me is not that it's in C, but that it's really complex and hard to hack on, as it wasn't designed to support custom dtypes, etc. from the start. There is a lot of ugly code in there that has been hacked in to support various functionality over time. If there was a clean dtype-extension system in C, then A) it wouldn't be bad C to write, and B) would be pretty easy to make a Cython-wrapped version. Travis gave a nice vision for the future, but in the meantime, I'm wondering: Could we hack in a generic "custom dtype" dtype object into the current system that would delegate everything to the dtype object -- in a truly object-oriented way. I'm imagining that this custom dtype object would be a pyObject and thus very hackable, easy to make a new subclass, etc -- essentially like making a new class in python that emulates one of the built-in type interfaces. This would be slow as a dog -- if inside that C loop, numpy would have to call out to python to do anyting, maybe as simple as arithmetic, but it would be clean, extensible system, and a good way for folks to plug in and try out new dtypes when performance didn't matter, or as prototypes for something that would get plugged in at the C level later once the API was worked out. Is this even possible without too much hacking to the current dtype system? Would it be as simple as adding a bit to the object dtype? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilshire461 at gmail.com Wed Sep 24 13:23:27 2014 From: wilshire461 at gmail.com (John) Date: Wed, 24 Sep 2014 12:23:27 -0500 Subject: [Numpy-discussion] Building numpy with OpenMP support Message-ID: I am in the process of trying to build numpy with OpenMP support but have had several issues. Has anyone else built it with success that could offer some guidance in what needs to be passed at build time. For reference I am using Atlas 3.10.2 built with OpenMP as well (-F alg -fopenmp) Thanks, JB From tjhnson at gmail.com Wed Sep 24 13:23:46 2014 From: tjhnson at gmail.com (T J) Date: Wed, 24 Sep 2014 12:23:46 -0500 Subject: [Numpy-discussion] Round away from zero (towards +/- infinity) Message-ID: Is there a ufunc for rounding away from zero? Or do I need to do x2 = sign(x) * ceil(abs(x)) whenever I want to round away from zero? Maybe the following is better? x_ceil = ceil(x) x_floor = floor(x) x2 = where(x >= 0, x_ceil, x_floor) Python's round function goes away from zero, so I am looking for the NumPy equivalent (and using vectorize() seems undesirable). In this sense, it seems that having a ufunc for this type of rounding could be helpful. Aside: Is there interest in a more general around() that allows users to specify alternative tie-breaking rules, with the default staying 'round half to nearest even'? [1] Also, what is the difference between NumPy's fix() and trunc() functions? It seems like they achieve the same goal. trunc() was added in 1.3.0. So is fix() just legacy? --- [1] http://stackoverflow.com/questions/16000574/tie-breaking-of-round-with-numpy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Wed Sep 24 14:52:11 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 24 Sep 2014 11:52:11 -0700 Subject: [Numpy-discussion] Add `nrows` to `genfromtxt` Message-ID: There is a PR in github that adds a new keyword to the genfromtxt function, to limit the number of rows that actually get read in: https://github.com/numpy/numpy/pull/5103 It is mostly ready to go, and several devs have looked at it without complaining. Since it is an API change, I wanted to check here: if no one has any strong opposition, I will be merging it sometime tomorrow. You have been warned... ;-) Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.isaac at gmail.com Wed Sep 24 15:20:15 2014 From: alan.isaac at gmail.com (Alan G Isaac) Date: Wed, 24 Sep 2014 15:20:15 -0400 Subject: [Numpy-discussion] Add `nrows` to `genfromtxt` In-Reply-To: References: Message-ID: <5423196F.4000507@gmail.com> On 9/24/2014 2:52 PM, Jaime Fern?ndez del R?o wrote: > There is a PR in github that adds a new keyword to the genfromtxt function, to limit the number of rows that actually get read in: > https://github.com/numpy/numpy/pull/5103 Sorry to come late to this party, but it seems to me that more versatile than an `nrows` keyword for the number of rows would be a "rows" keyword for a slice argument. fwiw, Alan Isaac From jaime.frio at gmail.com Wed Sep 24 16:54:38 2014 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Wed, 24 Sep 2014 13:54:38 -0700 Subject: [Numpy-discussion] Changes to the generalized functions. In-Reply-To: References: Message-ID: On Tue, Sep 23, 2014 at 4:31 PM, Charles R Harris wrote: > > > On Tue, Sep 23, 2014 at 4:59 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> Hi All, >> >> The question has come up as the whether of not to treat the new gufunc >> behavior as a bug fix, keeping the old constructor name, or have a >> different constructor. Keeping the name makes life easier as we don't need >> to edit the code where numpy currently uses gufuncs, but is risky if some >> third party depends on the old behavior. The gufuncs have been part of >> numpy since the 1.3 release, and google doesn't turn up any uses that I can >> see apart from repeats of numpy code. We can also make fixes if needed >> during the 1.10 beta release cycle. Even so, it is a bit of a risk. To >> spread the blame, if any, please weigh in on the following. >> >> >> 1. Yes, it is a bug, keep the current name and fix the behavior. >> 2. No, we need to be conservative and use a new function. >> >> To clarify the changes. Currently, if an input array does not have as > many dimensions as indicated by the signature, it is filled out with ones > to match the signature as well as broadcasting to the other array. It is > also the case that if a dimension in the signature is 1, then it is > broadcast with the data in the other signature. That is, the parts > identified in the signature are treated as little arrays. The proposed > change is to require that the inputs have at least as many dimensions as > the signature and that dimensions of 1 are not broadcast. That is, the > parts identified in the signature are treated as vectors or matrices rather > than little arrays. > It looks like we get to keep all the blame to ourselves... ;-) After sleeping on it, I am leaning more and more towards Nathaniel's proposal. Make it a bug, but leave it ready to be a warning if needed, preferably triggered by something like #define STRICT_SIGNATURE and a bunch of #ifdef's scattered through the code. It shouldn't be much more work than what already is in place. We would merge another commit right before the beta cycle removing the unused code, and if anyone complains revert that commit, remove the #define, and keep it for one release before deprecating it. Testing for both possibilities is going to be a little trickier to get right, but should be doable. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Sep 24 17:30:37 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 24 Sep 2014 15:30:37 -0600 Subject: [Numpy-discussion] Changes to the generalized functions. In-Reply-To: References: Message-ID: On Wed, Sep 24, 2014 at 2:54 PM, Jaime Fern?ndez del R?o < jaime.frio at gmail.com> wrote: > On Tue, Sep 23, 2014 at 4:31 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Tue, Sep 23, 2014 at 4:59 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> Hi All, >>> >>> The question has come up as the whether of not to treat the new gufunc >>> behavior as a bug fix, keeping the old constructor name, or have a >>> different constructor. Keeping the name makes life easier as we don't need >>> to edit the code where numpy currently uses gufuncs, but is risky if some >>> third party depends on the old behavior. The gufuncs have been part of >>> numpy since the 1.3 release, and google doesn't turn up any uses that I can >>> see apart from repeats of numpy code. We can also make fixes if needed >>> during the 1.10 beta release cycle. Even so, it is a bit of a risk. To >>> spread the blame, if any, please weigh in on the following. >>> >>> >>> 1. Yes, it is a bug, keep the current name and fix the behavior. >>> 2. No, we need to be conservative and use a new function. >>> >>> To clarify the changes. Currently, if an input array does not have as >> many dimensions as indicated by the signature, it is filled out with ones >> to match the signature as well as broadcasting to the other array. It is >> also the case that if a dimension in the signature is 1, then it is >> broadcast with the data in the other signature. That is, the parts >> identified in the signature are treated as little arrays. The proposed >> change is to require that the inputs have at least as many dimensions as >> the signature and that dimensions of 1 are not broadcast. That is, the >> parts identified in the signature are treated as vectors or matrices rather >> than little arrays. >> > > It looks like we get to keep all the blame to ourselves... ;-) > > After sleeping on it, I am leaning more and more towards Nathaniel's > proposal. Make it a bug, but leave it ready to be a warning if needed, > preferably triggered by something like #define STRICT_SIGNATURE and a bunch > of #ifdef's scattered through the code. It shouldn't be much more work than > what already is in place. We would merge another commit right before the > beta cycle removing the unused code, and if anyone complains revert that > commit, remove the #define, and keep it for one release before deprecating > it. > > Testing for both possibilities is going to be a little trickier to get > right, but should be doable. > I've been headed the other way, but using a simplified name for the new function: PyUFunc_FromFuncAndDataAndSignature2, which is along the lines of PyArray_MatrixProduct and PyArray_MatrixProduct2 in the numpy API. I do like Nathaniel's suggestion, but I also like taking chances and need to fight the urge ;) Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From saullogiovani at gmail.com Wed Sep 24 17:57:38 2014 From: saullogiovani at gmail.com (Saullo Castro) Date: Wed, 24 Sep 2014 23:57:38 +0200 Subject: [Numpy-discussion] Interpolation using `np.interp()` with periodic x-coordinates Message-ID: >From the closed pull request PR #5109: https://github.com/numpy/numpy/pull/5109 it came out that the a good implementation would be adding a parameter `period`. I would like to know about the community's interest for this implementation. The modification are shown here: https://github.com/saullocastro/numpy/compare/interp_with_period?expand=1 Please, let me know about your feedback. Regards, Saullo -------------- next part -------------- An HTML attachment was scrubbed... URL: From travis at continuum.io Wed Sep 24 18:11:05 2014 From: travis at continuum.io (Travis Oliphant) Date: Wed, 24 Sep 2014 17:11:05 -0500 Subject: [Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type In-Reply-To: References: Message-ID: This could actually be done by using the structured dtype pretty easily. The hard work would be improving the ufunc and generalized ufunc mechanism to handle structured data-types. Numba actually provides some of this already, so if you have NumPy + Numba you can do this sort of thing now. -Travis On Wed, Sep 24, 2014 at 12:08 PM, Chris Barker wrote: > On Tue, Sep 23, 2014 at 4:40 AM, Eric Moore > wrote: > >> Improving the dtype system requires working on c code. >> > > yes -- it sure does. But I think that is a bit of a Red Herring. I'm > barely competent in C, and don't like it much, but the real barrier to > entry for me is not that it's in C, but that it's really complex and hard > to hack on, as it wasn't designed to support custom dtypes, etc. from the > start. There is a lot of ugly code in there that has been hacked in to > support various functionality over time. If there was a clean > dtype-extension system in C, then A) it wouldn't be bad C to write, and B) > would be pretty easy to make a Cython-wrapped version. > > Travis gave a nice vision for the future, but in the meantime, I'm > wondering: > > Could we hack in a generic "custom dtype" dtype object into the current > system that would delegate everything to the dtype object -- in a truly > object-oriented way. I'm imagining that this custom dtype object would be a > pyObject and thus very hackable, easy to make a new subclass, etc -- > essentially like making a new class in python that emulates one of the > built-in type interfaces. > > This would be slow as a dog -- if inside that C loop, numpy would have to > call out to python to do anyting, maybe as simple as arithmetic, but it > would be clean, extensible system, and a good way for folks to plug in and > try out new dtypes when performance didn't matter, or as prototypes for > something that would get plugged in at the C level later once the API was > worked out. > > Is this even possible without too much hacking to the current dtype > system? Would it be as simple as adding a bit to the object dtype? > > -Chris > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Travis Oliphant CEO Continuum Analytics, Inc. http://www.continuum.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From aron at ahmadia.net Thu Sep 25 10:36:57 2014 From: aron at ahmadia.net (Aron Ahmadia) Date: Thu, 25 Sep 2014 10:36:57 -0400 Subject: [Numpy-discussion] Fwd: Sport Rock Friday? In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: *Brian Joseph* Date: Thursday, September 25, 2014 Subject: Sport Rock Friday? To: Aron Ahmadia Hey Aron want to hit up Sport Rock tomorrow night? Meg and I would like to go. Meg has to work until 8, so how does 8:30 sound? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aron at ahmadia.net Thu Sep 25 10:37:58 2014 From: aron at ahmadia.net (Aron Ahmadia) Date: Thu, 25 Sep 2014 10:37:58 -0400 Subject: [Numpy-discussion] Sport Rock Friday? In-Reply-To: References: Message-ID: Sorry for the SPAM folks. Itchy smartphone :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cimrman3 at ntc.zcu.cz Thu Sep 25 11:21:38 2014 From: cimrman3 at ntc.zcu.cz (Robert Cimrman) Date: Thu, 25 Sep 2014 17:21:38 +0200 Subject: [Numpy-discussion] ANN: SfePy 2014.3 Message-ID: <54243302.1010701@ntc.zcu.cz> I am pleased to announce release 2014.3 of SfePy. Description ----------- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (preliminary support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: http://groups.google.com/group/sfepy-devel Git (source) repository, issue tracker, wiki: http://github.com/sfepy Highlights of this release -------------------------- - isogeometric analysis (IGA) speed-up by C implementation of NURBS basis evaluation - generalized linear combination boundary conditions that work between different fields/variables and support non-homogeneous periodic conditions - non-constant essential boundary conditions given by a function in IGA - reorganized and improved documentation For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Best regards, Robert Cimrman and Contributors (*) (*) Contributors to this release (alphabetical order): Vladimir Lukes, Matyas Novak, Zhihua Ouyang, Jaroslav Vondrejc From warren.weckesser at gmail.com Thu Sep 25 19:56:02 2014 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Thu, 25 Sep 2014 19:56:02 -0400 Subject: [Numpy-discussion] Online docs for numpy are for version 1.8 Message-ID: Pinging the webmeisters: numpy 1.9 is released, but the docs at http://docs.scipy.org/doc/numpy/ are still for version 1.8. Warren From damian.avila at continuum.io Thu Sep 25 23:34:36 2014 From: damian.avila at continuum.io (Damian Avila) Date: Fri, 26 Sep 2014 00:34:36 -0300 Subject: [Numpy-discussion] ANN: Bokeh 0.6.1 release Message-ID: On behalf of the Bokeh team, I am very happy to announce the release of Bokeh version 0.6.1! Bokeh is a Python library for visualizing large and realtime datasets on the web. Its goal is to provide to developers (and domain experts) with capabilities to easily create novel and powerful visualizations that extract insight from local or remote (possibly large) data sets, and to easily publish those visualization to the web for others to explore and interact with. This point release includes several bug fixes and improvements over our most recent 0.6.0 release: * Toolbar enhancements * bokeh-server fixes * Improved documentation * Button widgets * Google map support in the Python side * Code cleanup in the JS side and examples * New examples See the CHANGELOG for full details. In upcoming releases, you should expect to see more new layout capabilities (colorbar axes, better grid plots and improved annotations), additional tools, even more widgets and more charts, R language bindings, Blaze integration and cloud hosting for Bokeh apps. Don't forget to check out the full documentation, interactive gallery, and tutorial at http://bokeh.pydata.org as well as the Bokeh IPython notebook nbviewer index (including all the tutorials) at: http://nbviewer.ipython.org/github/ContinuumIO/bokeh-notebooks/blob/master/index.ipynb If you are using Anaconda or miniconda, you can install with conda: conda install bokeh Alternatively, you can install with pip: pip install bokeh BokehJS is also available by CDN for use in standalone javascript applications: http://cdn.pydata.org/bokeh-0.6.1.min.js http://cdn.pydata.org/bokeh-0.6.1.min.css Issues, enhancement requests, and pull requests can be made on the Bokeh Github page: https://github.com/continuumio/bokeh Questions can be directed to the Bokeh mailing list: bokeh at continuum.io If you have interest in helping to develop Bokeh, please get involved! Cheers, Dami?n Avila damian.avila at continuum.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From lanceboyle at qwest.net Fri Sep 26 04:41:55 2014 From: lanceboyle at qwest.net (Jerry) Date: Fri, 26 Sep 2014 01:41:55 -0700 Subject: [Numpy-discussion] Hamming etc. windows are wrong Message-ID: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> I?ve noticed that the Hamming window (and I suppose other windows) provided by Octave, SciPy****, and NumPy***** is wrong, returning a window that is symmetric rather than one that is ?DFT symmetric.? This is easily spotted by looking at the first and last points of the window; if they are the same values, then the window is incorrect. These windows are normally applied to data that are equally spaced, thus implying that their DFTs are periodic. Let?s assume that we are windowing N data points, indexed from 0 to N-1 as is normal in signal processing. Then the DFT is also N points long, also indexed from 0 to N-1. The periodic extension principal implies that the point indexed by N in the DFT domain has the same value as the point indexed by 0. The window can also be imagined to have the same kind of periodic extension, so that what would be the N-th point of the window matches the N-point of the data with the same weight as was applied to the 0-th point of the data but without duplicating that point at N-1. This mistake is widespread, infecting many software packages. (I am sending this same note to the NumPy, SciPy, and Octave lists.) It is certainly wrong in most textbooks, including the classic, Oppenheim & Schafer, 1975. However, the definitive and widely referenced paper on windows by Fredric J. Harris, ?On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform,? Proceedings of the IEEE, vol. 66, NO. 1 , January, 1978 * discusses this problem, both its roots and its correction. For example, quoting from the paper: ?Since the DFT essentially considers sequences to be periodic, we can consider the missing end point to be the beginning of the next period of the periodic extension of this sequence. In fact, under the periodic extension, the next sample (at 16 s in Fig. 1.) is indistinguishable from the sample at zero seconds. ?This apparent lack of symmetry due to the missing (but implied) end point is a source of confusion in sampled window design. This can be traced to the early work related to convergence factors for the partial sums of the Fourier series. The partial sums (or the finite Fourier transform) always include an odd number of points and exhibit even symmetry about the origin. Hence much of the literature and many software libraries incorporate windows designed with true even symmetry rather than the implied symmetry with the missing end point!? ...and later... ?We will now catalog some well-known (and some not well-known windows. For each window we will comment on the justification for its use and identify its significant parameters. All the windows will be presented as even (about the origin) sequences with an odd number of points. To convert the window to DFT even, the right endpoint will be discarded and the sequence will be shifted so that the left end point coincides with the origin.? Note that he limits himself therefore to even-numbered "DFT even" windows, but the rule remains--honor the implied periodicity even for windows of odd length. Lest one consider this a trivial difference, consider the following Octave-Matlab code (corresponding NumPy and SciPy code would yield the same results). x = ones(8, 1); # An even sequence; its DFT should be real. hWrong = hamming(8) h9 = hamming(9); hRight = h9(1:8) xWrong = x .* hWrong; # Multiplication by "ones" xRight = x .* hRight; # shown for clarity. XWrong = fft(xWrong) # Contains nonzero imaginary part. XRight = fft(xRight) # Real, as expected from evenness of x. XWrongMag = abs(XWrong) # The magnitudes are XRightMag = abs(XRight) # also different. This causes the following output: hWrong = 0.080000 0.253195 0.642360 0.954446 0.954446 0.642360 0.253195 0.080000 hRight = 0.080000 0.214731 0.540000 0.865269 1.000000 0.865269 0.540000 0.214731 XWrong = 3.86000 + 0.00000i -1.76795 - 0.73231i 0.13889 + 0.13889i 0.01906 + 0.04602i 0.00000 + 0.00000i 0.01906 - 0.04602i 0.13889 - 0.13889i -1.76795 + 0.73231i XRight = 4.32000 + 0.00000i -1.84000 + 0.00000i 0.00000 + 0.00000i 0.00000 + 0.00000i 0.00000 + 0.00000i 0.00000 - 0.00000i 0.00000 - 0.00000i -1.84000 - 0.00000i XWrongMag = 3.86000 1.91362 0.19642 0.04981 0.00000 0.04981 0.19642 1.91362 XRightMag = 4.32000 1.84000 0.00000 0.00000 0.00000 0.00000 0.00000 1.84000 This news will likely upset many and I will be interested to see what arguments will flow in favor of keeping the status quo--I?m sure one will be, ?we?ve always done it this way? and others will be more substantial, possibly expressing decent arguments why the current situation is useful for some applications. In any case, I refer again to the Harris paper. So rather than host an argument on this list, this is what I propose: Do what Matlab** does. Acknowledge both uses by making a flag to handle the ?periodic? or the ?symmetric? cases. The Matlab default is ?symmetric? which is of course unfortunate but at least such inclusion in Octave, NumPy, and SciPy would retain compatibility with the existing usage. Then it?s up to the user whether to shoot him/herself in the foot, assuming that such a decision is guided by actually referring to the documentation for the package being used and not blindly using the default. Jerry Bauck * http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1455106&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1455106 ** http://www.mathworks.com/help/signal/ref/hamming.html *** http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.signal.hamming.html **** I am just learning Python-NumPy-SciPy but it appears as though the SciPy situation is that the documentation page above *** mentions the flag but the flag has not been implemented into SciPy itself. I would be glad to stand corrected. ***** http://docs.scipy.org/doc/numpy/reference/generated/numpy.hamming.html From davidmenhur at gmail.com Fri Sep 26 05:56:32 2014 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Fri, 26 Sep 2014 11:56:32 +0200 Subject: [Numpy-discussion] [SciPy-Dev] Hamming etc. windows are wrong In-Reply-To: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> Message-ID: On 26 September 2014 10:41, Jerry wrote: > **** I am just learning Python-NumPy-SciPy but it appears as though the > SciPy situation is that the documentation page above *** mentions the flag > but the flag has not been implemented into SciPy itself. I would be glad to > stand corrected. Regarding this, you can look at the source: odd = M % 2 if not sym and not odd: w = w[:-1] so it is implemented in Scipy, explicitely limited to even windows. It is not in Numpy, though (but that can be fixed). I think a reference is worth adding to the documentation, to make the intent and application of the sym flag clear. I am sure a PR will be most appreciated. /David. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lanceboyle at qwest.net Fri Sep 26 06:39:56 2014 From: lanceboyle at qwest.net (Jerry) Date: Fri, 26 Sep 2014 03:39:56 -0700 Subject: [Numpy-discussion] [SciPy-Dev] Hamming etc. windows are wrong In-Reply-To: References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> Message-ID: <47D57189-401F-4C37-B646-8565BFF7958A@qwest.net> On Sep 26, 2014, at 2:56 AM, Da?id wrote: > > On 26 September 2014 10:41, Jerry wrote: > **** I am just learning Python-NumPy-SciPy but it appears as though the SciPy situation is that the documentation page above *** mentions the flag but the flag has not been implemented into SciPy itself. I would be glad to stand corrected. Hmm.... Earlier I was getting a Python error when I tried to supply the second argument, sym. That was the reason for my comment but now it accepts the second argument without complaint. > > Regarding this, you can look at the source: > > odd = M % 2 > if not sym and not odd: > w = w[:-1] > > so it is implemented in Scipy, explicitely limited to even windows. I can't see any reason this should be limited to even-length windows. Like I said in my original note, the principal applies regardless of even- or odd-length windows. Jerry > It is not in Numpy, though (but that can be fixed). > > I think a reference is worth adding to the documentation, to make the intent and application of the sym flag clear. I am sure a PR will be most appreciated. > > > /David. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Sep 26 10:22:00 2014 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 26 Sep 2014 07:22:00 -0700 (PDT) Subject: [Numpy-discussion] ANN: Bokeh 0.6.1 release In-Reply-To: References: Message-ID: <70e53b8a-ec10-4b49-a54f-6de2e10e7053@googlegroups.com> Congrats on the release. I've been playing with Bokeh the last couple weeks and it's really good. There was some talk about expanding the ecosystem page in the pandas docs to show off brief examples for the various projects (btw I think we should rename that to **pydata** ecosystem, instead of pandas ecosystem). Would you say the API for something like a ColumnDataSource example (like the car example ) is stable enough to include in our docs? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bryanv at continuum.io Fri Sep 26 16:18:45 2014 From: bryanv at continuum.io (Bryan Van de Ven) Date: Fri, 26 Sep 2014 15:18:45 -0500 Subject: [Numpy-discussion] ANN: Bokeh 0.6.1 release In-Reply-To: <70e53b8a-ec10-4b49-a54f-6de2e10e7053@googlegroups.com> References: <70e53b8a-ec10-4b49-a54f-6de2e10e7053@googlegroups.com> Message-ID: Hi Tom, I would wait for 0.7, where we are going to make a few changes to the plotting.py interface. That API started out as a "stateful, implicit current plot" style interface, and experience has shown that this does not always work well, especially in the IPython notebook where you can have any order of execution of the cells. It basically not possible to reason about which plot is the "current plot" and we got lots of support issues around this. The interface is not changing drastically, but it is changing some. Basically instead of "line(...)" acting on some implicit plot, you will have "p.line(...)" acting on some explicitly specified plot "p". Currently 0.7 looks to be release late October/early November, all the examples will be updated at that time, and I would not expect that API to change any time soon after. Bryan On Sep 26, 2014, at 9:22 AM, Tom Augspurger wrote: > Congrats on the release. I've been playing with Bokeh the last couple weeks and it's really good. > > There was some talk about expanding the ecosystem page in the pandas docs to show off brief examples for the various projects (btw I think we should rename that to **pydata** ecosystem, instead of pandas ecosystem). > > Would you say the API for something like a ColumnDataSource example (like the car example) is stable enough to include in our docs? > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From pmhobson at gmail.com Fri Sep 26 17:03:46 2014 From: pmhobson at gmail.com (Paul Hobson) Date: Fri, 26 Sep 2014 14:03:46 -0700 Subject: [Numpy-discussion] [SciPy-Dev] Hamming etc. windows are wrong In-Reply-To: <47D57189-401F-4C37-B646-8565BFF7958A@qwest.net> References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> <47D57189-401F-4C37-B646-8565BFF7958A@qwest.net> Message-ID: On Fri, Sep 26, 2014 at 3:39 AM, Jerry wrote: > > On Sep 26, 2014, at 2:56 AM, Da?id wrote: > > > On 26 September 2014 10:41, Jerry wrote: > >> **** I am just learning Python-NumPy-SciPy but it appears as though the >> SciPy situation is that the documentation page above *** mentions the flag >> but the flag has not been implemented into SciPy itself. I would be glad to >> stand corrected. > > > Hmm.... Earlier I was getting a Python error when I tried to supply the > second argument, sym. That was the reason for my comment but now it accepts > the second argument without complaint. > Just curious: had you imported pylab at any point? You sample code looks like you did several "from xxx import *" and it might have clobbered the scipy version (or vise versa). -------------- next part -------------- An HTML attachment was scrubbed... URL: From lanceboyle at qwest.net Fri Sep 26 20:10:10 2014 From: lanceboyle at qwest.net (Jerry) Date: Fri, 26 Sep 2014 17:10:10 -0700 Subject: [Numpy-discussion] [SciPy-Dev] Hamming etc. windows are wrong In-Reply-To: References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> <47D57189-401F-4C37-B646-8565BFF7958A@qwest.net> Message-ID: <6D3A944A-E4F3-4B79-8E7A-443758AE1358@qwest.net> On Sep 26, 2014, at 2:03 PM, Paul Hobson wrote: > > > On Fri, Sep 26, 2014 at 3:39 AM, Jerry wrote: > > On Sep 26, 2014, at 2:56 AM, Da?id wrote: > >> >> On 26 September 2014 10:41, Jerry wrote: >> **** I am just learning Python-NumPy-SciPy but it appears as though the SciPy situation is that the documentation page above *** mentions the flag but the flag has not been implemented into SciPy itself. I would be glad to stand corrected. > > Hmm.... Earlier I was getting a Python error when I tried to supply the second argument, sym. That was the reason for my comment but now it accepts the second argument without complaint. > > Just curious: had you imported pylab at any point? You sample code looks like you did several "from xxx import *" and it might have clobbered the scipy version (or vise versa). I don't recall what I had imported, but probably scipy, as in "import scipy", or maybe "import scipy as sc." I do believe that I was working interactively so I don't have any actual evidence at this point. But I just had this thought?maybe I was in numpy and not scipy. For some reason both numpy and scipy provide a hamming function (why?) so maybe I was using the numpy version by accident which IIRC does not have the second argument. My sample code is in Octave. 8^) Jerry > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Sep 27 06:37:12 2014 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 27 Sep 2014 11:37:12 +0100 Subject: [Numpy-discussion] [SciPy-Dev] Hamming etc. windows are wrong In-Reply-To: <6D3A944A-E4F3-4B79-8E7A-443758AE1358@qwest.net> References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> <47D57189-401F-4C37-B646-8565BFF7958A@qwest.net> <6D3A944A-E4F3-4B79-8E7A-443758AE1358@qwest.net> Message-ID: On Sat, Sep 27, 2014 at 1:10 AM, Jerry wrote: > I don't recall what I had imported, but probably scipy, as in "import > scipy", or maybe "import scipy as sc." I do believe that I was working > interactively so I don't have any actual evidence at this point. > > But I just had this thought?maybe I was in numpy and not scipy. For some > reason both numpy and scipy provide a hamming function (why?) so maybe I was > using the numpy version by accident which IIRC does not have the second > argument. > > My sample code is in Octave. 8^) scipy's hamming() function is scipy.signal.hamming(). It's in the subpackage, and you need to import it like so: from scipy import signal signal.hamming(...) There *is* a scipy.hamming() function as well, which is just an alias for numpy.hamming(). [~] |1> import scipy [~] |2> import numpy [~] |3> scipy.hamming is numpy.hamming True Explaining why is a long story, but let's just say "historical reasons" and leave it at that. You almost never want to just "import scipy". All of the scipy goodness is in the subpackages. -- Robert Kern From darcamo at gmail.com Sat Sep 27 12:37:59 2014 From: darcamo at gmail.com (Darlan Cavalcante Moreira) Date: Sat, 27 Sep 2014 13:37:59 -0300 Subject: [Numpy-discussion] Strange bug in SVD when numpy is installed with pip Message-ID: <87vbo96mu0.fsf@gmail.com> Some time ago I have reported a bug about the linalg.matrix_rank in numpy for complex matrices. This was quickly fixed and to take the advantage of the fix I'm now using numpy 1.9.0 installed through pip, instead of the version from my system (Ubuntu 14.04, with numpy version 1.8.1). However, I have now encountered a very strange bug in the SVD function, but only when numpy is manually installed (with pip in my case). When I calculate the SVD of complex matrices with more columns than rows the last rows of the returned V_H matrix are all equal to zeros. This does not happens for all shapes, but for the ones where this happens it will always happen. This can be reproduced with the code below. You can change the sizes of M and N and it happens for other sizes where N > M, but now all of them. --8<---------------cut here---------------start------------->8--- import numpy as np M = 8 # Number of rows N = 12 # Number of columns # Calculate the SVD of a non-square complex random matrix [U, S, V_H] = np.linalg.svd(np.random.randn(M, N) + 1j*np.random.randn(M, N), full_matrices=True) # Calculate the norm of the submatrix formed by the last N-M rows if np.linalg.norm(V_H[M-N:]) < 1e-30: print("Bug!") else: print("No Bug") # See the N-M rows. They are all equal to zeros print(V_H[M-N:]) --8<---------------cut here---------------end--------------->8--- The original matrix can still be obtained from the decomposition, since the zero rows correspond to zero singular values due to the fact that the original matrix has more columns then rows. However, since the user asked for 'full_matrices' here (the default) returning all zeroes for these extra rows is not useful. In order to isolate the bug I tried installing some different numpy versions in different virtualenvs. I tried version 1.6, 1.7, 1.8.1 and 1.9 and the bug appears in all of then. Since it does not happen if I use the version 1.8.1 installed through the Ubuntu package manager, I imagine it is due to some issue when pip compiles numpy locally. Note: If I run numpy.testing.test() all tests are OK for all numpy versions I have tested. The only fail is a known fail and I get "OK (KNOWNFAIL=1)". -- Darlan Cavalcante Moreira From charlesr.harris at gmail.com Sat Sep 27 13:33:18 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 27 Sep 2014 11:33:18 -0600 Subject: [Numpy-discussion] Strange bug in SVD when numpy is installed with pip In-Reply-To: <87vbo96mu0.fsf@gmail.com> References: <87vbo96mu0.fsf@gmail.com> Message-ID: On Sat, Sep 27, 2014 at 10:37 AM, Darlan Cavalcante Moreira < darcamo at gmail.com> wrote: > > Some time ago I have reported a bug about the linalg.matrix_rank in > numpy for complex matrices. This was quickly fixed and to take the > advantage of the fix I'm now using numpy 1.9.0 installed through pip, > instead of the version from my system (Ubuntu 14.04, with numpy version > 1.8.1). > > However, I have now encountered a very strange bug in the SVD function, > but only when numpy is manually installed (with pip in my case). When I > calculate the SVD of complex matrices with more columns than rows the > last rows of the returned V_H matrix are all equal to zeros. This does > not happens for all shapes, but for the ones where this happens it will > always happen. > > This can be reproduced with the code below. You can change the sizes of > M and N and it happens for other sizes where N > M, but now all of them. > > --8<---------------cut here---------------start------------->8--- > import numpy as np > > M = 8 # Number of rows > N = 12 # Number of columns > > # Calculate the SVD of a non-square complex random matrix > [U, S, V_H] = np.linalg.svd(np.random.randn(M, N) + 1j*np.random.randn(M, > N), full_matrices=True) > > # Calculate the norm of the submatrix formed by the last N-M rows > if np.linalg.norm(V_H[M-N:]) < 1e-30: > print("Bug!") > else: > print("No Bug") > # See the N-M rows. They are all equal to zeros > print(V_H[M-N:]) > --8<---------------cut here---------------end--------------->8--- > > The original matrix can still be obtained from the decomposition, since > the zero rows correspond to zero singular values due to the fact that > the original matrix has more columns then rows. However, since the user > asked for 'full_matrices' here (the default) returning all zeroes for > these extra rows is not useful. > > In order to isolate the bug I tried installing some different numpy > versions in different virtualenvs. I tried version 1.6, 1.7, 1.8.1 and > 1.9 and the bug appears in all of then. Since it does not happen if I > use the version 1.8.1 installed through the Ubuntu package manager, I > imagine it is due to some issue when pip compiles numpy locally. > > Note: If I run numpy.testing.test() all tests are OK for all numpy > versions I > have tested. The only fail is a known fail and I get "OK (KNOWNFAIL=1)". > > What does `np.__config__.show()` show? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From darcamo at gmail.com Sat Sep 27 15:46:10 2014 From: darcamo at gmail.com (Darlan Cavalcante Moreira) Date: Sat, 27 Sep 2014 16:46:10 -0300 Subject: [Numpy-discussion] Strange bug in SVD when numpy is installed with pip In-Reply-To: References: <87vbo96mu0.fsf@gmail.com> Message-ID: <87tx3s7sot.fsf@gmail.com> >>> np.__config__.show() lapack_info: NOT AVAILABLE lapack_opt_info: NOT AVAILABLE blas_info: NOT AVAILABLE atlas_threads_info: NOT AVAILABLE blas_src_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE lapack_src_info: NOT AVAILABLE openblas_info: NOT AVAILABLE atlas_blas_threads_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE blas_opt_info: NOT AVAILABLE atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE mkl_info: NOT AVAILABLE With so many "NOT AVAILABLE" I'm now surprised SVD is even defined. Charles R Harris writes: > On Sat, Sep 27, 2014 at 10:37 AM, Darlan Cavalcante Moreira < > darcamo at gmail.com> wrote: > >> >> Some time ago I have reported a bug about the linalg.matrix_rank in >> numpy for complex matrices. This was quickly fixed and to take the >> advantage of the fix I'm now using numpy 1.9.0 installed through pip, >> instead of the version from my system (Ubuntu 14.04, with numpy version >> 1.8.1). >> >> However, I have now encountered a very strange bug in the SVD function, >> but only when numpy is manually installed (with pip in my case). When I >> calculate the SVD of complex matrices with more columns than rows the >> last rows of the returned V_H matrix are all equal to zeros. This does >> not happens for all shapes, but for the ones where this happens it will >> always happen. >> >> This can be reproduced with the code below. You can change the sizes of >> M and N and it happens for other sizes where N > M, but now all of them. >> >> --8<---------------cut here---------------start------------->8--- >> import numpy as np >> >> M = 8 # Number of rows >> N = 12 # Number of columns >> >> # Calculate the SVD of a non-square complex random matrix >> [U, S, V_H] = np.linalg.svd(np.random.randn(M, N) + 1j*np.random.randn(M, >> N), full_matrices=True) >> >> # Calculate the norm of the submatrix formed by the last N-M rows >> if np.linalg.norm(V_H[M-N:]) < 1e-30: >> print("Bug!") >> else: >> print("No Bug") >> # See the N-M rows. They are all equal to zeros >> print(V_H[M-N:]) >> --8<---------------cut here---------------end--------------->8--- >> >> The original matrix can still be obtained from the decomposition, since >> the zero rows correspond to zero singular values due to the fact that >> the original matrix has more columns then rows. However, since the user >> asked for 'full_matrices' here (the default) returning all zeroes for >> these extra rows is not useful. >> >> In order to isolate the bug I tried installing some different numpy >> versions in different virtualenvs. I tried version 1.6, 1.7, 1.8.1 and >> 1.9 and the bug appears in all of then. Since it does not happen if I >> use the version 1.8.1 installed through the Ubuntu package manager, I >> imagine it is due to some issue when pip compiles numpy locally. >> >> Note: If I run numpy.testing.test() all tests are OK for all numpy >> versions I >> have tested. The only fail is a known fail and I get "OK (KNOWNFAIL=1)". >> >> > What does `np.__config__.show()` show? > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Sent with my mu4e From charlesr.harris at gmail.com Sat Sep 27 16:37:10 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 27 Sep 2014 14:37:10 -0600 Subject: [Numpy-discussion] Strange bug in SVD when numpy is installed with pip In-Reply-To: <87tx3s7sot.fsf@gmail.com> References: <87vbo96mu0.fsf@gmail.com> <87tx3s7sot.fsf@gmail.com> Message-ID: On Sat, Sep 27, 2014 at 1:46 PM, Darlan Cavalcante Moreira < darcamo at gmail.com> wrote: > > >>> np.__config__.show() > lapack_info: > NOT AVAILABLE > lapack_opt_info: > NOT AVAILABLE > blas_info: > NOT AVAILABLE > atlas_threads_info: > NOT AVAILABLE > blas_src_info: > NOT AVAILABLE > atlas_blas_info: > NOT AVAILABLE > lapack_src_info: > NOT AVAILABLE > openblas_info: > NOT AVAILABLE > atlas_blas_threads_info: > NOT AVAILABLE > blas_mkl_info: > NOT AVAILABLE > blas_opt_info: > NOT AVAILABLE > atlas_info: > NOT AVAILABLE > lapack_mkl_info: > NOT AVAILABLE > mkl_info: > NOT AVAILABLE > > > With so many "NOT AVAILABLE" I'm now surprised SVD is even defined. > > > Looks like it is falling back to numpy's internal versions. You might try installing ATLAS to see if that makes a difference. BTW, the numpy custom is bottom posting. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From darcamo at gmail.com Sat Sep 27 18:10:26 2014 From: darcamo at gmail.com (darcamo at gmail.com) Date: Sat, 27 Sep 2014 19:10:26 -0300 Subject: [Numpy-discussion] Strange bug in SVD when numpy is installed with pip In-Reply-To: References: <87vbo96mu0.fsf@gmail.com> <87tx3s7sot.fsf@gmail.com> Message-ID: Thanks Charles! I installed atlas and lapack and then reinstalled numpy with pip. It works correctly now. An easy fix for this for now is raising an exception when numpy is using its internal version for SVD and the user asked for the full matrices. The exception message can instruct the user to either install lapack and recompile numpy, or set full_matrices to False (if that is enough and the user can't reinstall numpy with lapack for some reason). This would avoid the bug hunting I had to do in my code. Em 27/09/2014 17:37, "Charles R Harris" escreveu: > > > On Sat, Sep 27, 2014 at 1:46 PM, Darlan Cavalcante Moreira < > darcamo at gmail.com> wrote: > >> >> >>> np.__config__.show() >> lapack_info: >> NOT AVAILABLE >> lapack_opt_info: >> NOT AVAILABLE >> blas_info: >> NOT AVAILABLE >> atlas_threads_info: >> NOT AVAILABLE >> blas_src_info: >> NOT AVAILABLE >> atlas_blas_info: >> NOT AVAILABLE >> lapack_src_info: >> NOT AVAILABLE >> openblas_info: >> NOT AVAILABLE >> atlas_blas_threads_info: >> NOT AVAILABLE >> blas_mkl_info: >> NOT AVAILABLE >> blas_opt_info: >> NOT AVAILABLE >> atlas_info: >> NOT AVAILABLE >> lapack_mkl_info: >> NOT AVAILABLE >> mkl_info: >> NOT AVAILABLE >> >> >> With so many "NOT AVAILABLE" I'm now surprised SVD is even defined. >> >> >> > Looks like it is falling back to numpy's internal versions. You might try > installing ATLAS to see if that makes a difference. > > BTW, the numpy custom is bottom posting. > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daverz at gmail.com Sun Sep 28 15:05:59 2014 From: daverz at gmail.com (Dave Cook) Date: Sun, 28 Sep 2014 12:05:59 -0700 Subject: [Numpy-discussion] Hamming etc. windows are wrong In-Reply-To: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> References: <32A9B832-5B17-4C3D-B573-336E4700FA36@qwest.net> Message-ID: On Fri, Sep 26, 2014 at 1:41 AM, Jerry wrote: > I?ve noticed that the Hamming window (and I suppose other windows) > provided by Octave, SciPy****, and NumPy***** is wrong, returning a window > that is symmetric rather than one that is ?DFT symmetric.? This is easily > spotted by looking at the first and last points of the window; if they are > the same values, then the window is incorrect. > If you use signal.get_window(), the default is sym=False: def get_window(window, Nx, fftbins=True): # snip sym = not fftbins Dave Cook -------------- next part -------------- An HTML attachment was scrubbed... URL: From jzwinck at gmail.com Tue Sep 30 06:49:47 2014 From: jzwinck at gmail.com (John Zwinck) Date: Tue, 30 Sep 2014 18:49:47 +0800 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names Message-ID: I first proposed this on GitHub: https://github.com/numpy/numpy/issues/5134 ; jaimefrio requested that I bring it to this list for discussion. My proposal is to add a keys() method to NumPy's array class ndarray. The behavior would be to return self.dtype.names, i.e. the "column names" for a structured array (and None when dtype.names is None, which it is for pure numeric arrays without named columns). I originally proposed to add a values() method also, but I am tabling that for now so we needn't discuss it in this thread. The motivation is to enhance the ability to use duck typing with NumPy arrays, Python dicts, and other types like Pandas DataFrames, h5py Files, and more. It's a fairly common thing to want to get the "keys" of a container, where "keys" is understood to be a sequence of values one can pass to __getitem__(), and this is exactly what I'm aiming at. Thoughts? John Zwinck From hoogendoorn.eelco at gmail.com Tue Sep 30 07:29:08 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Tue, 30 Sep 2014 13:29:08 +0200 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: Sounds fair to me. Indeed the ducktyping argument makes sense, and I have a hard time imagining any namespace conflicts or other confusion. Should this attribute return none for non-structured arrays, or simply be undefined? On Tue, Sep 30, 2014 at 12:49 PM, John Zwinck wrote: > I first proposed this on GitHub: > https://github.com/numpy/numpy/issues/5134 ; jaimefrio requested that > I bring it to this list for discussion. > > My proposal is to add a keys() method to NumPy's array class ndarray. > The behavior would be to return self.dtype.names, i.e. the "column > names" for a structured array (and None when dtype.names is None, > which it is for pure numeric arrays without named columns). > > I originally proposed to add a values() method also, but I am tabling > that for now so we needn't discuss it in this thread. > > The motivation is to enhance the ability to use duck typing with NumPy > arrays, Python dicts, and other types like Pandas DataFrames, h5py > Files, and more. It's a fairly common thing to want to get the "keys" > of a container, where "keys" is understood to be a sequence of values > one can pass to __getitem__(), and this is exactly what I'm aiming at. > > Thoughts? > > John Zwinck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ferdinand.bayard at strains.fr Tue Sep 30 09:15:10 2014 From: ferdinand.bayard at strains.fr (Bayard) Date: Tue, 30 Sep 2014 15:15:10 +0200 Subject: [Numpy-discussion] f2py and debug mode Message-ID: <542AACDE.7010408@strains.fr> Hello to all. I'm aiming to wrap a Fortran program into Python. I started to work with f2py, and am trying to setup a debug mode where I could reach breakpoints in Fortran module launched by Python. I've been looking in the existing post, but not seeing things like that. I'm used to work with visual studio 2012 and Intel Fortran compiler, I have tried to get that point doing : 1) Run f2py -m to get *.c wrapper 2) Embed it in a C Project in Visual Studio, containing also with fortranobject.c and fortranobject.h, 3) Create a solution which also contains my fortran files compiled as a lib 4) Generate in debug mode a "dll" with extension pyd (to get to that point name of the "main" function in Fortran by "_main"). I compiled without any error, and reach break point in C Wrapper, but not in Fortran, and the fortran code seems not to be executed (whereas it is when compiling with f2py -c). Trying to understand f2py code, I noticed that f2py is not only writing c-wrapper, but compiling it in a specific way. Is there a way to get a debug mode in Visual Studio with f2py (some members of the team are used to it) ? Any alternative tool we should use for debugging ? Thanks for answering Ferdinand --- Ce courrier ?lectronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com From ben.root at ou.edu Tue Sep 30 09:19:49 2014 From: ben.root at ou.edu (Benjamin Root) Date: Tue, 30 Sep 2014 09:19:49 -0400 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: I am also +1. I have already used structured arrays to do keyword-based string formatting. This makes sense as well. Would this enable keyword argument expansion? On Tue, Sep 30, 2014 at 7:29 AM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > Sounds fair to me. Indeed the ducktyping argument makes sense, and I have > a hard time imagining any namespace conflicts or other confusion. Should > this attribute return none for non-structured arrays, or simply be > undefined? > > On Tue, Sep 30, 2014 at 12:49 PM, John Zwinck wrote: > >> I first proposed this on GitHub: >> https://github.com/numpy/numpy/issues/5134 ; jaimefrio requested that >> I bring it to this list for discussion. >> >> My proposal is to add a keys() method to NumPy's array class ndarray. >> The behavior would be to return self.dtype.names, i.e. the "column >> names" for a structured array (and None when dtype.names is None, >> which it is for pure numeric arrays without named columns). >> >> I originally proposed to add a values() method also, but I am tabling >> that for now so we needn't discuss it in this thread. >> >> The motivation is to enhance the ability to use duck typing with NumPy >> arrays, Python dicts, and other types like Pandas DataFrames, h5py >> Files, and more. It's a fairly common thing to want to get the "keys" >> of a container, where "keys" is understood to be a sequence of values >> one can pass to __getitem__(), and this is exactly what I'm aiming at. >> >> Thoughts? >> >> John Zwinck >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Sep 30 14:05:21 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 30 Sep 2014 11:05:21 -0700 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: I like this idea. But I am -1 on returning None if the array is unstructured. I expect .keys(), if present, to always return an iterable. In fact, this would break some of my existing code, which checks for the existence of "keys" as a way to do duck typed checks for dictionary like objects (e.g., including pandas.DataFrame): https://github.com/xray/xray/blob/v0.3/xray/core/utils.py#L165 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Tue Sep 30 16:21:09 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Tue, 30 Sep 2014 22:21:09 +0200 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: So a non-structured array should return an empty list/iterable as its keys? That doesn't seem right to me, but perhaps you have a compelling example to the contrary. I mean, wouldn't we want the duck-typing to fail if it isn't a structured array? Throwing an attributeError seems like the best thing to do, from a duck-typing perspective. On Tue, Sep 30, 2014 at 8:05 PM, Stephan Hoyer wrote: > I like this idea. But I am -1 on returning None if the array is > unstructured. I expect .keys(), if present, to always return an iterable. > > In fact, this would break some of my existing code, which checks for the > existence of "keys" as a way to do duck typed checks for dictionary like > objects (e.g., including pandas.DataFrame): > https://github.com/xray/xray/blob/v0.3/xray/core/utils.py#L165 > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Tue Sep 30 16:22:07 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Tue, 30 Sep 2014 22:22:07 +0200 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: On more careful reading of your words, I think we agree; indeed, if keys() is present is should return an iterable; but I don't think it should be present for non-structured arrays. On Tue, Sep 30, 2014 at 10:21 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > So a non-structured array should return an empty list/iterable as its > keys? That doesn't seem right to me, but perhaps you have a compelling > example to the contrary. > > I mean, wouldn't we want the duck-typing to fail if it isn't a structured > array? Throwing an attributeError seems like the best thing to do, from a > duck-typing perspective. > > On Tue, Sep 30, 2014 at 8:05 PM, Stephan Hoyer wrote: > >> I like this idea. But I am -1 on returning None if the array is >> unstructured. I expect .keys(), if present, to always return an iterable. >> >> In fact, this would break some of my existing code, which checks for the >> existence of "keys" as a way to do duck typed checks for dictionary like >> objects (e.g., including pandas.DataFrame): >> https://github.com/xray/xray/blob/v0.3/xray/core/utils.py#L165 >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Sep 30 16:30:43 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 30 Sep 2014 13:30:43 -0700 Subject: [Numpy-discussion] Proposal: add ndarray.keys() to return dtype.names In-Reply-To: References: Message-ID: On Tue, Sep 30, 2014 at 1:22 PM, Eelco Hoogendoorn < hoogendoorn.eelco at gmail.com> wrote: > On more careful reading of your words, I think we agree; indeed, if keys() > is present is should return an iterable; but I don't think it should be > present for non-structured arrays. > Indeed, I think we do agree. The attribute can simply be missing (e.g., accessing it raises AttributeError) for non-structured arrays. -------------- next part -------------- An HTML attachment was scrubbed... URL: