From scott.sinclair.za at gmail.com Wed Jun 1 04:54:49 2011 From: scott.sinclair.za at gmail.com (Scott Sinclair) Date: Wed, 1 Jun 2011 10:54:49 +0200 Subject: [Numpy-discussion] Problem installing Numpy in virtualenv Message-ID: Hi, This is a response to http://mail.scipy.org/pipermail/numpy-discussion/2011-April/055908.html A cleaner workaround that doesn't mess with your system Python (see https://github.com/pypa/virtualenv/issues/118) Activate the virtualenv mkdir $VIRTUAL_ENV/local ln -s $VIRTUAL_ENV/lib $VIRTUAL_ENV/local/lib Install Numpy Cheers, Scott From bettenbu at uni-potsdam.de Wed Jun 1 05:44:08 2011 From: bettenbu at uni-potsdam.de (Mario Bettenbuehl) Date: Wed, 01 Jun 2011 11:44:08 +0200 Subject: [Numpy-discussion] Numeric integration of higher order integrals Message-ID: <4DE609E8.6030103@uni-potsdam.de> Hello everyone, I am currently tackling the issue to numerically solve an integral of higher dimensions numerically. I am comparing models and their dimension increase with 2^n order. Taking a closer look to its projections along the axes, down to a two dimensions picture, the projections are of Gaussian nature, thus they show a Gaussian bump. I already used to approaches: 1. brute force: Process the values at discrete grid points and calculate the area of the obtained rectangle, cube, ... with a grid of 5x5x5x5 for a 4th order equation. 2. Gaussian quad: Cascading Gaussian quadrature given from numpy/ scipy with a grid size of 100x100x... The problem I have: For 1: How reliable are the results and does anyone have experience with equations whose projections are Gaussian like and solved these with the straight-forward-method? But how large should the grid be. For 2: How large do I need to choose the grid to still obtain reliable results? Is a grid of 10x10 sufficiently large? Thanks in advance for any reply. If needed, I'll directly provide further informations about the problem. Greetings, Mario -------------- next part -------------- A non-text attachment was scrubbed... Name: bettenbu.vcf Type: text/x-vcard Size: 402 bytes Desc: not available URL: From aarchiba at physics.mcgill.ca Wed Jun 1 08:47:00 2011 From: aarchiba at physics.mcgill.ca (Anne Archibald) Date: Wed, 1 Jun 2011 08:47:00 -0400 Subject: [Numpy-discussion] Numeric integration of higher order integrals In-Reply-To: <4DE609E8.6030103@uni-potsdam.de> References: <4DE609E8.6030103@uni-potsdam.de> Message-ID: When the dimensionality gets high, grid methods like you're describing start to be a problem ("the curse of dimensionality"). The standard approaches are simple Monte Carlo integration or its refinements (Metropolis-Hasings, for example). These converge somewhat slowly, but are not much affected by the dimensionality. Anne On 1 June 2011 05:44, Mario Bettenbuehl wrote: > Hello everyone, > > I am currently tackling the issue to numerically solve an integral of higher > dimensions numerically. I am comparing models > and their dimension increase with 2^n order. > Taking a closer look to its projections along the axes, down to a two > dimensions picture, the projections are of Gaussian nature, thus > they show a Gaussian bump. > > I already used to approaches: > ? ?1. brute force: ? ? ? ?Process the values at discrete grid points and > calculate the area of the obtained rectangle, cube, ... with a grid of > 5x5x5x5 for a 4th order equation. > ? ?2. Gaussian quad: ? ?Cascading Gaussian quadrature given from numpy/ > scipy with a grid size of 100x100x... > > The problem I have: > For 1: ? ?How reliable are the results and does anyone have experience with > equations whose projections are Gaussian like and solved these with the > straight-forward-method? But how large should the grid be. > For 2: ? ?How large do I need to choose the grid to still obtain reliable > results? Is a grid of 10x10 sufficiently large? > > Thanks in advance for any reply. If needed, I'll directly provide further > informations about the problem. > > Greetings, > Mario > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From kwgoodman at gmail.com Wed Jun 1 11:17:55 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Wed, 1 Jun 2011 08:17:55 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Tue, May 31, 2011 at 8:41 PM, Charles R Harris wrote: > On Tue, May 31, 2011 at 8:50 PM, Bruce Southey wrote: >> How about including all or some of Keith's Bottleneck package? >> He has tried to include some of the discussed functions and tried to >> make them very fast. > > I don't think they are sufficiently general as they are limited to 2 > dimensions. However, I think the moving filters should go into scipy, either > in ndimage or maybe signals. Some of the others we can still speed of > significantly, for instance nanmedian, by using the new functionality in > numpy, i.e., numpy sort has worked with nans for a while now. It looks like > call overhead dominates the nanmax times for small arrays and this might > improve if the ufunc machinery is cleaned up a bit more, I don't know how > far Mark got with that. Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else falls back to a slower, non-cython version of the function. The same goes for int32, int64, float32, float64. It should not be difficult to extend to higher nd and more dtypes since everything is generated from template. The problem is that there would be a LOT of cython auto-generated C code since there is a separate function for each ndim, dtype, axis combination. Each of the ndim, dtype, axis functions currently has its own copy of the algorithm (such as median). Pulling that out and reusing it should save a lot of trees by reducing the auto-generated C code size. I recently added a partsort and argpartsort. From markperrymiller at gmail.com Wed Jun 1 11:31:55 2011 From: markperrymiller at gmail.com (Mark Miller) Date: Wed, 1 Jun 2011 08:31:55 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: I'd love to see something like a "count_unique" function included. The numpy.unique function is handy, but it can be a little awkward to efficiently go back and get counts of each unique value after the fact. -Mark On Wed, Jun 1, 2011 at 8:17 AM, Keith Goodman wrote: > On Tue, May 31, 2011 at 8:41 PM, Charles R Harris > wrote: >> On Tue, May 31, 2011 at 8:50 PM, Bruce Southey wrote: > >>> How about including all or some of Keith's Bottleneck package? >>> He has tried to include some of the discussed functions and tried to >>> make them very fast. >> >> I don't think they are sufficiently general as they are limited to 2 >> dimensions. However, I think the moving filters should go into scipy, either >> in ndimage or maybe signals. Some of the others we can still speed of >> significantly, for instance nanmedian, by using the new functionality in >> numpy, i.e., numpy sort has worked with nans for a while now. It looks like >> call overhead dominates the nanmax times for small arrays and this might >> improve if the ufunc machinery is cleaned up a bit more, I don't know how >> far Mark got with that. > > Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else > falls back to a slower, non-cython version of the function. The same > goes for int32, int64, float32, float64. > > It should not be difficult to extend to higher nd and more dtypes > since everything is generated from template. The problem is that there > would be a LOT of cython auto-generated C code since there is a > separate function for each ndim, dtype, axis combination. > > Each of the ndim, dtype, axis functions currently has its own copy of > the algorithm (such as median). Pulling that out and reusing it should > save a lot of trees by reducing the auto-generated C code size. > > I recently added a partsort and argpartsort. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From craigyk at me.com Wed Jun 1 11:44:20 2011 From: craigyk at me.com (Craig Yoshioka) Date: Wed, 01 Jun 2011 08:44:20 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way np.mean(np.zeros([4000,4000],'f4')+500) would not equal 511.493408? ` On May 31, 2011, at 6:08 PM, Charles R Harris wrote: > Hi All, > > I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest. > > 1) Modified sort/argsort functions that return the maximum k values. > This is easy to do with heapsort and almost as easy with mergesort. > > 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible. > > 3) Fast medians. > > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From gmane at blindgoat.org Wed Jun 1 11:45:15 2011 From: gmane at blindgoat.org (martin smith) Date: Wed, 01 Jun 2011 11:45:15 -0400 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows Message-ID: I have a bit of code that performs multi-taper power spectra using numpy and a C extension module. The C portion consists of an interface file and a python-unaware computational file. The latter invokes fftpack. The straightforward setup.py appended below works fine on Linux. On Windows using MinGW it's not so good. I first created a libpython27.a file following instructions on the web. The C code now compiles but fails to link since there is no fftpack_lite.so on Windows. I can't find an fftpack_lite.dll (or even the source for fftpack_lite) and don't know where to turn. Trying to sort these problems out in Windows is (to me) difficult and bewildering. I'd much appreciate any advice. - martin smith (BTW: the code is unencumbered and available for the asking.) ---------------------------------------------------------------------- from numpy.distutils.core import setup from numpy.distutils.extension import Extension import os.path import distutils.sysconfig as sf use_fftpack = 1 if use_fftpack: pymutt_module = Extension( 'pymutt', ['pymutt.c'], library_dirs = [os.path.join(sf.get_python_lib(), 'numpy', 'fft')], define_macros = [('USE_FFTPACK', None)], libraries = [':fftpack_lite.so'], ) else: pymutt_module = Extension( 'pymutt', ['pymutt.c'], libraries = ['fftw3'], ) setup(name='pymutt', version='1.0', description = "numpy support for multi-taper fourier transforms", author = "Martin L. Smith", long_description = ''' Implements a numpy interface to an implementation of Thomson's multi-taper fourier transform algorithms. The key C module (mtbase.c) is derived from code written and made freely available by J. M. Lees and Jeff Park. ''', ext_modules=[pymutt_module], ) ---------------------------------------------------------------------------------------- From ndbecker2 at gmail.com Wed Jun 1 11:59:52 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 01 Jun 2011 11:59:52 -0400 Subject: [Numpy-discussion] New functions. References: Message-ID: Short-circuiting find would be nice. Right now, to 'find' something you first make a bool array, then iterate over it. If all you want is the first index where x[i] = e, not very efficient. What I just described is a find with a '==' predicate. Not sure if it's worthwhile to consider other predicates. Maybe call it 'find_first' Mark Miller wrote: > I'd love to see something like a "count_unique" function included. The > numpy.unique function is handy, but it can be a little awkward to > efficiently go back and get counts of each unique value after the > fact. > > -Mark > > > > On Wed, Jun 1, 2011 at 8:17 AM, Keith Goodman wrote: >> On Tue, May 31, 2011 at 8:41 PM, Charles R Harris >> wrote: >>> On Tue, May 31, 2011 at 8:50 PM, Bruce Southey wrote: >> >>>> How about including all or some of Keith's Bottleneck package? >>>> He has tried to include some of the discussed functions and tried to >>>> make them very fast. >>> >>> I don't think they are sufficiently general as they are limited to 2 >>> dimensions. However, I think the moving filters should go into scipy, either >>> in ndimage or maybe signals. Some of the others we can still speed of >>> significantly, for instance nanmedian, by using the new functionality in >>> numpy, i.e., numpy sort has worked with nans for a while now. It looks like >>> call overhead dominates the nanmax times for small arrays and this might >>> improve if the ufunc machinery is cleaned up a bit more, I don't know how >>> far Mark got with that. >> >> Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else >> falls back to a slower, non-cython version of the function. The same >> goes for int32, int64, float32, float64. >> >> It should not be difficult to extend to higher nd and more dtypes >> since everything is generated from template. The problem is that there >> would be a LOT of cython auto-generated C code since there is a >> separate function for each ndim, dtype, axis combination. >> >> Each of the ndim, dtype, axis functions currently has its own copy of >> the algorithm (such as median). Pulling that out and reusing it should >> save a lot of trees by reducing the auto-generated C code size. >> >> I recently added a partsort and argpartsort. >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> From robert.kern at gmail.com Wed Jun 1 12:01:37 2011 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 1 Jun 2011 11:01:37 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka wrote: > would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. ?That way > > np.mean(np.zeros([4000,4000],'f4')+500) > > would not equal 511.493408? Yes, I object. You can set the accumulator dtype explicitly if you need it: np.mean(arr, dtype=np.float64) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From bsouthey at gmail.com Wed Jun 1 12:11:06 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Jun 2011 11:11:06 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: <4DE6649A.2020204@gmail.com> On 06/01/2011 11:01 AM, Robert Kern wrote: > On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka wrote: >> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way >> >> np.mean(np.zeros([4000,4000],'f4')+500) >> >> would not equal 511.493408? > Yes, I object. You can set the accumulator dtype explicitly if you > need it: np.mean(arr, dtype=np.float64) > Sure but fails to address that the output dtype of mean in this case is np.float64 which one would naively assume is also np.float64: >>> np.mean(np.zeros([4000,4000],'f4')+500).dtype dtype('float64') Thus, we have: Tickets 465 and 518 http://projects.scipy.org/numpy/ticket/465 http://projects.scipy.org/numpy/ticket/518 Various threads as well such as: http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html Bruce From robert.kern at gmail.com Wed Jun 1 12:25:59 2011 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 1 Jun 2011 11:25:59 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: <4DE6649A.2020204@gmail.com> References: <4DE6649A.2020204@gmail.com> Message-ID: On Wed, Jun 1, 2011 at 11:11, Bruce Southey wrote: > On 06/01/2011 11:01 AM, Robert Kern wrote: >> On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka ?wrote: >>> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. ?That way >>> >>> np.mean(np.zeros([4000,4000],'f4')+500) >>> >>> would not equal 511.493408? >> Yes, I object. You can set the accumulator dtype explicitly if you >> need it: np.mean(arr, dtype=np.float64) >> > Sure but fails to address that the output dtype of mean in this case is > np.float64 which one would naively assume is also np.float64: > ?>>> np.mean(np.zeros([4000,4000],'f4')+500).dtype > dtype('float64') Of course my statement fails to address that because that wasn't what I was responding to. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From craigyk at me.com Wed Jun 1 12:29:50 2011 From: craigyk at me.com (Craig Yoshioka) Date: Wed, 01 Jun 2011 09:29:50 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: <4DE6649A.2020204@gmail.com> References: <4DE6649A.2020204@gmail.com> Message-ID: yes, and its probably slower to boot. A quick benchmark on my computer shows that: a = np.zeros([4000,4000],'f4')+500 np.mean(a) takes 0.02 secs np.mean(a,dtype=np.float64) takes 0.1 secs np.mean(a.astype(np.float64)) takes 0.06 secs so casting the whole array is almost 40% faster than setting the accumulator type!, I would imagine having the type-casting being done on the fly during the mean computation would be even faster. On Jun 1, 2011, at 9:11 AM, Bruce Southey wrote: > On 06/01/2011 11:01 AM, Robert Kern wrote: >> On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka wrote: >>> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way >>> >>> np.mean(np.zeros([4000,4000],'f4')+500) >>> >>> would not equal 511.493408? >> Yes, I object. You can set the accumulator dtype explicitly if you >> need it: np.mean(arr, dtype=np.float64) >> > Sure but fails to address that the output dtype of mean in this case is > np.float64 which one would naively assume is also np.float64: >>>> np.mean(np.zeros([4000,4000],'f4')+500).dtype > dtype('float64') > > Thus, we have: > Tickets 465 and 518 > http://projects.scipy.org/numpy/ticket/465 > http://projects.scipy.org/numpy/ticket/518 > > Various threads as well such as: > http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html > > Bruce > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsseabold at gmail.com Wed Jun 1 12:32:09 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Wed, 1 Jun 2011 12:32:09 -0400 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 11:31 AM, Mark Miller wrote: > I'd love to see something like a "count_unique" function included. The > numpy.unique function is handy, but it can be a little awkward to > efficiently go back and get counts of each unique value after the > fact. > Does bincount do what you're after? [~/] [1]: x = np.random.randint(1,6,5) [~/] [2]: x [2]: array([1, 3, 4, 5, 1]) [~/] [3]: np.bincount(x) [3]: array([0, 2, 0, 1, 1, 1]) Skipper From markperrymiller at gmail.com Wed Jun 1 12:49:16 2011 From: markperrymiller at gmail.com (Mark Miller) Date: Wed, 1 Jun 2011 09:49:16 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't.... >>> a = numpy.array((1,500,1000)) >>> a array([ 1, 500, 1000]) >>> b = numpy.bincount(a) >>> b array([0, 1, 0, ..., 0, 0, 1]) >>> len(b) 1001 -Mark On Wed, Jun 1, 2011 at 9:32 AM, Skipper Seabold wrote: > On Wed, Jun 1, 2011 at 11:31 AM, Mark Miller wrote: >> I'd love to see something like a "count_unique" function included. The >> numpy.unique function is handy, but it can be a little awkward to >> efficiently go back and get counts of each unique value after the >> fact. >> > > Does bincount do what you're after? > > [~/] > [1]: x = np.random.randint(1,6,5) > > [~/] > [2]: x > [2]: array([1, 3, 4, 5, 1]) > > [~/] > [3]: np.bincount(x) > [3]: array([0, 2, 0, 1, 1, 1]) > > Skipper > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From Chris.Barker at noaa.gov Wed Jun 1 13:30:20 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Wed, 01 Jun 2011 10:30:20 -0700 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: <4DE6772C.1060802@noaa.gov> On 5/31/11 6:08 PM, Charles R Harris wrote: > 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. so: In [53]: a Out[53]: array([ 1., 2., nan]) In [54]: b Out[54]: array([0, 1, 2]) In [55]: a + b Out[55]: array([ 1., 3., nan]) and nanadd(a,b) would yield: array([ 1., 3., 2.) I don't see how that is particularly useful, at least not any more useful that nanprod, nandiv, etc, etc... What am I missing? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From robert.kern at gmail.com Wed Jun 1 13:36:30 2011 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 1 Jun 2011 12:36:30 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: <4DE6772C.1060802@noaa.gov> References: <4DE6772C.1060802@noaa.gov> Message-ID: On Wed, Jun 1, 2011 at 12:30, Christopher Barker wrote: > On 5/31/11 6:08 PM, Charles R Harris wrote: >> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. > > so: > > In [53]: a > Out[53]: array([ ?1., ? 2., ?nan]) > > In [54]: b > Out[54]: array([0, 1, 2]) > > In [55]: a + b > Out[55]: array([ ?1., ? 3., ?nan]) > > and nanadd(a,b) would yield: > > array([ ?1., ? 3., ?2.) > > I don't see how that is particularly useful, at least not any more > useful that nanprod, nandiv, etc, etc... > > What am I missing? It's mostly useful for its .reduce() and .accumulate() form. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From josef.pktd at gmail.com Wed Jun 1 15:09:36 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 1 Jun 2011 15:09:36 -0400 Subject: [Numpy-discussion] New functions. In-Reply-To: References: <4DE6772C.1060802@noaa.gov> Message-ID: My favorite missing extension to numpy functions np.bincount with 2 (or more) dimensional weights for fast calculation of group statistics. Josef From mwwiebe at gmail.com Wed Jun 1 16:05:12 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 1 Jun 2011 15:05:12 -0500 Subject: [Numpy-discussion] fixing up datetime Message-ID: Hey all, So I'm doing a summer internship at Enthought, and the first thing they asked me to look into is finishing the datetime type in numpy. It turns out that the estimates of how complete the type was weren't accurate, and to support what the NEP describes required generalizing the ufunc type resolution system. I also found that the date/time parsing code (based on mxDateTime) was not robust, producing something for almost any arbitrary garbage input. I've replaced much of the broken code and implemented a lot of the functionality, and thought this might be a good point to do a pull request on what I've got and get feedback on the issues I've run into. * The existing datetime-related API is probably not useful, and in fact those functions aren't used internally anymore. Is it reasonable to remove the functions, or do we just deprecate them? * Leap seconds probably deserve a rigorous treatment, but having an internal representation with leap-seconds overcomplicates otherwise very simple and fast operations. Could we internally use a value matching TAI or GPS time? Currently it's a UTC time in the present, but the 1970 epoch is then not the UTC 1970 epoch, but 10s of seconds off, and this isn't properly specified. What are people's opinions? The Python datetime.datetime doesn't support leap seconds (seconds == 60 is disallowed). * Default conversion to string - should it be in UTC or with the local timezone baked in? As UTC it may be confusing because 'now' will print as a different time than people would expect. * Business days - The existing business idea doesn't seem very useful, representing just the western M-F work week and not accounting for holidays. I've come up with a design which might address these issues: Extend the metadata for business days with a string identifier, like 'M8[B:USA]', then have a global internal dictionary which maps 'USA' to a workweek mask and a list of holidays. The call to prepare this dictionary for a particular business day type might look like np.set_business_days('USA', [1, 1, 1, 1, 1, 0, 0], np.array([ list of US holidays ], dtype='M8[D]')). Internally, business days would be stored the same as regular days, but with special treatment where landing on a weekend or holiday gives you back a NaT. If you are interested in the business day functionality, please comment on this design! * The dtype constructor accepted 'O#' for object types, something I think was wrong. I've removed that, but allow # to be 4 or 8, producing a deprecation warning if it occurs. * Would it make sense to offset the week-based datetime's epoch so it aligns with ISO 8601's week format? Jan 1, 1970 is a thursday, but the YYYY-Www date format uses weeks starting on monday. I think producing strings in this format when the datetime has units of weeks would be a natural thing to do. * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. NaT == NaT return false, etc. Should operations generating NaT trigger an 'invalid' floating point exception in ufuncs? Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 1 16:52:08 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 1 Jun 2011 14:52:08 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 2:05 PM, Mark Wiebe wrote: > Hey all, > > So I'm doing a summer internship at Enthought, and the first thing they > asked me to look into is finishing the datetime type in numpy. It turns out > that the estimates of how complete the type was weren't accurate, and to > support what the NEP describes required generalizing the ufunc type > resolution system. I also found that the date/time parsing code (based on > mxDateTime) was not robust, producing something for almost any arbitrary > garbage input. I've replaced much of the broken code and implemented a lot > of the functionality, and thought this might be a good point to do a pull > request on what I've got and get feedback on the issues I've run into. > > * The existing datetime-related API is probably not useful, and in fact > those functions aren't used internally anymore. Is it reasonable to remove > the functions, or do we just deprecate them? > > * Leap seconds probably deserve a rigorous treatment, but having an > internal representation with leap-seconds overcomplicates otherwise very > simple and fast operations. Could we internally use a value matching TAI or > GPS time? Currently it's a UTC time in the present, but the 1970 epoch is > then not the UTC 1970 epoch, but 10s of seconds off, and this isn't properly > specified. What are people's opinions? The Python datetime.datetime doesn't > support leap seconds (seconds == 60 is disallowed). > > * Default conversion to string - should it be in UTC or with the local > timezone baked in? As UTC it may be confusing because 'now' will print as a > different time than people would expect. > > * Business days - The existing business idea doesn't seem very useful, > representing just the western M-F work week and not accounting for holidays. > I've come up with a design which might address these issues: Extend the > metadata for business days with a string identifier, like 'M8[B:USA]', then > have a global internal dictionary which maps 'USA' to a workweek mask and a > list of holidays. The call to prepare this dictionary for a particular > business day type might look like np.set_business_days('USA', [1, 1, 1, 1, > 1, 0, 0], np.array([ list of US holidays ], dtype='M8[D]')). Internally, > business days would be stored the same as regular days, but with special > treatment where landing on a weekend or holiday gives you back a NaT. > > If you are interested in the business day functionality, please comment on > this design! > > * The dtype constructor accepted 'O#' for object types, something I think > was wrong. I've removed that, but allow # to be 4 or 8, producing a > deprecation warning if it occurs. > > * Would it make sense to offset the week-based datetime's epoch so it > aligns with ISO 8601's week format? Jan 1, 1970 is a thursday, but the > YYYY-Www date format uses weeks starting on monday. I think producing > strings in this format when the datetime has units of weeks would be a > natural thing to do. > > * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. > NaT == NaT return false, etc. Should operations generating NaT trigger an > 'invalid' floating point exception in ufuncs? > > Just a quick comment, as this really needs more thought, but time is a bag of worms. Trying to represent some standard -- say seconds at the solar system barycenter to account for general relativity -- is something that I think is too complicated and specialized to put into numpy. Good support for units and delta times is very useful, but parsing dates and times and handling timezones, daylight savings, leap seconds, business days, etc., is probably best served by addon packages specialized to an area of interest. Just my $.02 Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Wed Jun 1 17:01:09 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Wed, 1 Jun 2011 23:01:09 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 10:05 PM, Mark Wiebe wrote: > Hey all, > > So I'm doing a summer internship at Enthought, and the first thing they > asked me to look into is finishing the datetime type in numpy. It turns out > that the estimates of how complete the type was weren't accurate, and to > support what the NEP describes required generalizing the ufunc type > resolution system. I also found that the date/time parsing code (based on > mxDateTime) was not robust, producing something for almost any arbitrary > garbage input. I've replaced much of the broken code and implemented a lot > of the functionality, and thought this might be a good point to do a pull > request on what I've got and get feedback on the issues I've run into. > > * The existing datetime-related API is probably not useful, and in fact > those functions aren't used internally anymore. Is it reasonable to remove > the functions, or do we just deprecate them? > If the existing API is really not useful (which requires some discussion/review I guess) then I think it would be good to announce that on the mailing list and throw it out ASAP. The API can't have many users yet, since it has only just been released (again), so the sooner it's gone the better. I know normal policy would be to deprecate first, but I don't really see the point. And after the removal of datetime from 1.4.1 and now this, I'd be in favor of putting a large "experimental" sticker over the whole thing until further notice. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 1 17:04:18 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 1 Jun 2011 15:04:18 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 3:01 PM, Ralf Gommers wrote: > > > On Wed, Jun 1, 2011 at 10:05 PM, Mark Wiebe wrote: > >> Hey all, >> >> So I'm doing a summer internship at Enthought, and the first thing they >> asked me to look into is finishing the datetime type in numpy. It turns out >> that the estimates of how complete the type was weren't accurate, and to >> support what the NEP describes required generalizing the ufunc type >> resolution system. I also found that the date/time parsing code (based on >> mxDateTime) was not robust, producing something for almost any arbitrary >> garbage input. I've replaced much of the broken code and implemented a lot >> of the functionality, and thought this might be a good point to do a pull >> request on what I've got and get feedback on the issues I've run into. >> >> * The existing datetime-related API is probably not useful, and in fact >> those functions aren't used internally anymore. Is it reasonable to remove >> the functions, or do we just deprecate them? >> > > If the existing API is really not useful (which requires some > discussion/review I guess) then I think it would be good to announce that on > the mailing list and throw it out ASAP. The API can't have many users yet, > since it has only just been released (again), so the sooner it's gone the > better. I know normal policy would be to deprecate first, but I don't really > see the point. > > +1 > And after the removal of datetime from 1.4.1 and now this, I'd be in favor > of putting a large "experimental" sticker over the whole thing until further > notice. > > Do we have a good way to do that? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Wed Jun 1 17:10:55 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Wed, 1 Jun 2011 23:10:55 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 11:04 PM, Charles R Harris wrote: > > > On Wed, Jun 1, 2011 at 3:01 PM, Ralf Gommers wrote: > >> >> >> On Wed, Jun 1, 2011 at 10:05 PM, Mark Wiebe wrote: >> >>> Hey all, >>> >>> So I'm doing a summer internship at Enthought, and the first thing they >>> asked me to look into is finishing the datetime type in numpy. It turns out >>> that the estimates of how complete the type was weren't accurate, and to >>> support what the NEP describes required generalizing the ufunc type >>> resolution system. I also found that the date/time parsing code (based on >>> mxDateTime) was not robust, producing something for almost any arbitrary >>> garbage input. I've replaced much of the broken code and implemented a lot >>> of the functionality, and thought this might be a good point to do a pull >>> request on what I've got and get feedback on the issues I've run into. >>> >>> * The existing datetime-related API is probably not useful, and in fact >>> those functions aren't used internally anymore. Is it reasonable to remove >>> the functions, or do we just deprecate them? >>> >> >> If the existing API is really not useful (which requires some >> discussion/review I guess) then I think it would be good to announce that on >> the mailing list and throw it out ASAP. The API can't have many users yet, >> since it has only just been released (again), so the sooner it's gone the >> better. I know normal policy would be to deprecate first, but I don't really >> see the point. >> >> > +1 > > >> And after the removal of datetime from 1.4.1 and now this, I'd be in favor >> of putting a large "experimental" sticker over the whole thing until further >> notice. >> >> > Do we have a good way to do that? > > Not a perfect one, but if there would be a prominent note in all related docs that would help a lot already. A more intrusive way would be to raise warnings. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 1 17:16:06 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 1 Jun 2011 16:16:06 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris wrote: > > > Just a quick comment, as this really needs more thought, but time is a bag > of worms. > Certainly a bag of worms, I agree. > Trying to represent some standard -- say seconds at the solar system > barycenter to account for general relativity -- is something that I think is > too complicated and specialized to put into numpy. > > Good support for units and delta times is very useful, > This part works fairly well now, except for some questions like what should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe "2011-02-28", or "2011-03-02"? > but parsing dates and times > I've implemented an almost-ISO 8601 date-time parser. I had to deviate a bit to support years outside the little 10000-year window we use. I think supporting more formats could be handled by writing a function which takes its date format and outputs ISO 8601 format to feed numpy. > and handling timezones, daylight savings, > The only timezone to handle is "local", which it looks like standard C has a rich enough API for this to be ok. > leap seconds, > This one can be mitigated by referencing TAI or GPS time, I think. POSIX time looks like a can of worms though. > business days, > I think Enthought has some interest in a decent business day functionality. Some feedback from people doing time series and financial modeling would help clarify this though. This link may provide some context: http://www.financialcalendar.com/Data/Holidays/Formats -Mark > etc., is probably best served by addon packages specialized to an area of > interest. Just my $.02 > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 1 17:27:12 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 1 Jun 2011 16:27:12 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 4:01 PM, Ralf Gommers wrote: > > > On Wed, Jun 1, 2011 at 10:05 PM, Mark Wiebe wrote: > >> Hey all, >> >> So I'm doing a summer internship at Enthought, and the first thing they >> asked me to look into is finishing the datetime type in numpy. It turns out >> that the estimates of how complete the type was weren't accurate, and to >> support what the NEP describes required generalizing the ufunc type >> resolution system. I also found that the date/time parsing code (based on >> mxDateTime) was not robust, producing something for almost any arbitrary >> garbage input. I've replaced much of the broken code and implemented a lot >> of the functionality, and thought this might be a good point to do a pull >> request on what I've got and get feedback on the issues I've run into. >> >> * The existing datetime-related API is probably not useful, and in fact >> those functions aren't used internally anymore. Is it reasonable to remove >> the functions, or do we just deprecate them? >> > > If the existing API is really not useful (which requires some > discussion/review I guess) then I think it would be good to announce that on > the mailing list and throw it out ASAP. The API can't have many users yet, > since it has only just been released (again), so the sooner it's gone the > better. I know normal policy would be to deprecate first, but I don't really > see the point. > Here's a quick rundown: PyArray_SetDatetimeParseFunction: The NEP indicated ISO 8601 format for input, and I think a function which converts arbitrary input into ISO 8601 format is better than setting a function inside NumPy itself. In my current branch, this function does nothing. PyArray_DatetimeToDatetimeStruct, and the 3 other similar ones: These functions only take the base unit for conversion, not the full metadata. To properly do the conversions, you also need the multiplier and the number of events, and if the extended business day idea I described is implemented, that workday/holiday data. I've made equivalent internal functions that take the metadata as the parameter instead, but I'm hesitant to expose public API until things are baked a bit more. And after the removal of datetime from 1.4.1 and now this, I'd be in favor > of putting a large "experimental" sticker over the whole thing until further > notice. > +1 -Mark > > Ralf > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 1 17:29:31 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 1 Jun 2011 15:29:31 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe wrote: > On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> >> Just a quick comment, as this really needs more thought, but time is a bag >> of worms. >> > > Certainly a bag of worms, I agree. > > >> Trying to represent some standard -- say seconds at the solar system >> barycenter to account for general relativity -- is something that I think is >> too complicated and specialized to put into numpy. >> > >> > Good support for units and delta times is very useful, >> > > This part works fairly well now, except for some questions like what should > datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe "2011-02-28", > or "2011-03-02"? > > >> but parsing dates and times >> > > I've implemented an almost-ISO 8601 date-time parser. I had to deviate a > bit to support years outside the little 10000-year window we use. I think > supporting more formats could be handled by writing a function which takes > its date format and outputs ISO 8601 format to feed numpy. > Heh, 10000 years is way outside the window in which the Gregorian calender is useful anyway. Now, strict Julian days start at the creation of the world so that's a real standard, sort of like degrees Kelvins ;) > and handling timezones, daylight savings, >> > > The only timezone to handle is "local", which it looks like standard C has > a rich enough API for this to be ok. > > >> leap seconds, >> > > This one can be mitigated by referencing TAI or GPS time, I think. POSIX > time looks like a can of worms though. > > >> business days, >> > > I think Enthought has some interest in a decent business day functionality. > Some feedback from people doing time series and financial modeling would > help clarify this though. This link may provide some context: > > http://www.financialcalendar.com/Data/Holidays/Formats > > Well, that can change. Likewise, who wants to track ERS for the leap seconds? I think the important thing is that the low level support needed to implement such things is available. A package for common use cases that serves as an example and that can be extended/modified by others would also be helpful. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jun 1 19:47:38 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 2 Jun 2011 00:47:38 +0100 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris wrote: > > > On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe wrote: >> >> On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris >> wrote: >>> >>> >>> Just a quick comment, as this really needs more thought, but time is a >>> bag of worms. >> >> Certainly a bag of worms, I agree. >> >>> >>> Trying to represent some standard -- say seconds at the solar system >>> barycenter to account for general relativity -- is something that I think is >>> too complicated and specialized to put into numpy. >>> >>> >>> >>> Good support for units and delta times is very useful, >> >> This part works fairly well now, except for some questions like what >> should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe >> "2011-02-28", or "2011-03-02"? >> >>> >>> but parsing dates and times >> >> I've implemented an almost-ISO 8601 date-time parser. ?I had to deviate a >> bit to support years outside the little 10000-year window we use. I think >> supporting more formats could be handled by writing a function which takes >> its date format and outputs ISO 8601 format to feed numpy. > > Heh, 10000 years is way outside the window in which the Gregorian calender > is useful anyway. Now, strict Julian days start at the creation of the world > so that's a real standard, sort of like degrees Kelvins ;) > >>> >>> and handling timezones,?daylight savings, >> >> The only timezone to handle is "local", which it looks like standard C has >> a rich enough API for this to be ok. >> >>> >>> leap seconds, >> >> This one can be mitigated by referencing TAI or GPS time, I think. POSIX >> time looks like a can of worms though. >> >>> >>> business days, >> >> I think Enthought has some interest in a decent business day >> functionality. Some feedback from people doing time series and financial >> modeling would help clarify this though. This link may provide some context: >> http://www.financialcalendar.com/Data/Holidays/Formats > > Well, that can change. Likewise, who wants to track ERS for the leap > seconds? > > I think the important thing is that the low level support needed to > implement such things is available. A package for common use cases that > serves as an example and that can be extended/modified by others would also > be helpful. > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > Being involved in financial data analysis (e.g. via pandas) this stuff will affect me very directly. I'll get back to you soon with some comments. - Wes From cournape at gmail.com Wed Jun 1 21:35:45 2011 From: cournape at gmail.com (David Cournapeau) Date: Thu, 2 Jun 2011 10:35:45 +0900 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller wrote: > Not quite. Bincount is fine if you have a set of approximately > sequential numbers. But if you don't.... Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases. cheers, David From alan.isaac at gmail.com Wed Jun 1 22:10:02 2011 From: alan.isaac at gmail.com (Alan G Isaac) Date: Wed, 01 Jun 2011 22:10:02 -0400 Subject: [Numpy-discussion] bincount limitations In-Reply-To: References: Message-ID: <4DE6F0FA.4070805@gmail.com> > On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller wrote: >> Not quite. Bincount is fine if you have a set of approximately >> sequential numbers. But if you don't.... On 6/1/2011 9:35 PM, David Cournapeau wrote: > Even worse, it fails miserably if you sequential numbers but with a high shift. > np.bincount([100000001, 100000002]) # will take a lof of memory > Doing bincount with dict is faster in those cases. Since this discussion has turned shortcomings of bincount, may I ask why np.bincount([]) is not an empty array? Even more puzzling, why is np.bincount([],minlength=6) not a 6-array of zeros? Use case: bincount of infected individuals by number of contacts. (In some periods there may be no infections.) Thank you, Alan Isaac PS A collections.Counter works pretty nice for Mark and David's cases, aside from the fact that the keys are not sorted. From oliphant at enthought.com Wed Jun 1 23:06:19 2011 From: oliphant at enthought.com (Travis Oliphant) Date: Wed, 1 Jun 2011 22:06:19 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On May 31, 2011, at 8:08 PM, Charles R Harris wrote: > Hi All, > > I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest. > > 1) Modified sort/argsort functions that return the maximum k values. > This is easy to do with heapsort and almost as easy with mergesort. +1 > > 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible. +0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context. > > 3) Fast medians. Definitely order-statistics would be the right thing to handle generally. -Travis > > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion --- Travis Oliphant Enthought, Inc. oliphant at enthought.com 1-512-536-1057 http://www.enthought.com From josef.pktd at gmail.com Thu Jun 2 04:42:40 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 2 Jun 2011 04:42:40 -0400 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau wrote: > On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller wrote: >> Not quite. Bincount is fine if you have a set of approximately >> sequential numbers. But if you don't.... > > Even worse, it fails miserably if you sequential numbers but with a high shift. > > np.bincount([100000001, 100000002]) # will take a lof of memory > > Doing bincount with dict is faster in those cases. same with negative numbers, but in these cases I just subtract the min and we are back to the fast bincount case cheers, Josef > > cheers, > > David > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From butterw at gmail.com Thu Jun 2 04:44:34 2011 From: butterw at gmail.com (Peter Butterworth) Date: Thu, 2 Jun 2011 10:44:34 +0200 Subject: [Numpy-discussion] np.histogram: upper range bin Message-ID: in np.histogram the top-most bin edge is inclusive of the upper range limit. As documented in the docstring (see below) this is actually the expected behavior, but this can lead to some weird enough results: In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) Is there any way round this or an alternative implementation without this issue ? Currently it seems you have to explicitly specify a range with an extra bin at the high end: In [77]: x=[1, 2, 3, 4]; np.histogram(x, bins=4, range=[1, 5]) Out[77]: (array([1, 1, 1, 1]), array([ 1., 2., 3., 4., 5.])) ''' Notes ----- All but the last (righthand-most) bin is half-open. In other words, if `bins` is:: [1, 2, 3, 4] then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes* 4. ''' -- thanks, peter butterworth From cournape at gmail.com Thu Jun 2 05:17:54 2011 From: cournape at gmail.com (David Cournapeau) Date: Thu, 2 Jun 2011 18:17:54 +0900 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 5:42 PM, wrote: > On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau wrote: >> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller wrote: >>> Not quite. Bincount is fine if you have a set of approximately >>> sequential numbers. But if you don't.... >> >> Even worse, it fails miserably if you sequential numbers but with a high shift. >> >> np.bincount([100000001, 100000002]) # will take a lof of memory >> >> Doing bincount with dict is faster in those cases. > > same with negative numbers, but in these cases I just subtract the min > and we are back to the fast bincount case Indeed, you can also deal with large numbers which are not consecutive by using a lookup-table. All those methods are quite error-prone in general, and reallly, there is no reason why bincount could not handle the general case, cheers, David From marc.shivers at gmail.com Thu Jun 2 08:25:06 2011 From: marc.shivers at gmail.com (Marc Shivers) Date: Thu, 2 Jun 2011 08:25:06 -0400 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: Regarding business days, for financial data,?trading days are determined by each exchange, so, in particular, there is no such thing as a US Trading Calendar, only a NYSE Calendar, NYMEX Calendar, etc, etc... . ?I think it would be useful to have functionality where you could start with a weekday calendar with specified normal trading hours and easily remove a list of non-trading days that the user specifies as well as modify the days where the trading hours are shorter (e.g. the day after thanksgiving). I think anything more specific should probably be relegated to a specific package for financial calendars, as this entire topic gets very messy very fast. -Marc On Wed, Jun 1, 2011 at 7:47 PM, Wes McKinney wrote: > On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris > wrote: >> >> >> On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe wrote: >>> >>> On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris >>> wrote: >>>> >>>> >>>> Just a quick comment, as this really needs more thought, but time is a >>>> bag of worms. >>> >>> Certainly a bag of worms, I agree. +1 >>> >>>> >>>> Trying to represent some standard -- say seconds at the solar system >>>> barycenter to account for general relativity -- is something that I think is >>>> too complicated and specialized to put into numpy. >>>> >>>> >>>> >>>> Good support for units and delta times is very useful, >>> >>> This part works fairly well now, except for some questions like what >>> should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe >>> "2011-02-28", or "2011-03-02"? >>> >>>> >>>> but parsing dates and times >>> >>> I've implemented an almost-ISO 8601 date-time parser. ?I had to deviate a >>> bit to support years outside the little 10000-year window we use. I think >>> supporting more formats could be handled by writing a function which takes >>> its date format and outputs ISO 8601 format to feed numpy. >> >> Heh, 10000 years is way outside the window in which the Gregorian calender >> is useful anyway. Now, strict Julian days start at the creation of the world >> so that's a real standard, sort of like degrees Kelvins ;) >> >>>> >>>> and handling timezones,?daylight savings, >>> >>> The only timezone to handle is "local", which it looks like standard C has >>> a rich enough API for this to be ok. >>> >>>> >>>> leap seconds, >>> >>> This one can be mitigated by referencing TAI or GPS time, I think. POSIX >>> time looks like a can of worms though. >>> >>>> >>>> business days, >>> >>> I think Enthought has some interest in a decent business day >>> functionality. Some feedback from people doing time series and financial >>> modeling would help clarify this though. This link may provide some context: >>> http://www.financialcalendar.com/Data/Holidays/Formats >> >> Well, that can change. Likewise, who wants to track ERS for the leap >> seconds? >> >> I think the important thing is that the low level support needed to >> implement such things is available. A package for common use cases that >> serves as an example and that can be extended/modified by others would also >> be helpful. >> >> Chuck >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > Being involved in financial data analysis (e.g. via pandas) this stuff > will affect me very directly. I'll get back to you soon with some > comments. > > - Wes > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From mwwiebe at gmail.com Thu Jun 2 10:16:16 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 09:16:16 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 7:25 AM, Marc Shivers wrote: > Regarding business days, for financial data, trading days are > determined by each exchange, so, in particular, there is no such thing > as a US Trading Calendar, only a NYSE Calendar, NYMEX Calendar, etc, > etc... . I think it would be useful to have functionality where you > could start with a weekday calendar with specified normal trading > hours and easily remove a list of non-trading days that the user > specifies as well as modify the days where the trading hours are > shorter (e.g. the day after thanksgiving). > What kind of functions/data would be needed for days with shorter trading hours? The design I've sketched out supports everything you're describing except things related to trading hours. > > I think anything more specific should probably be relegated to a > specific package for financial calendars, as this entire topic gets > very messy very fast. > I definitely agree, NumPy would have no knowledge of specifics, just a mechanism to tell it when holidays are, etc, and do reasonable calculations on them. -Mark > > -Marc > > On Wed, Jun 1, 2011 at 7:47 PM, Wes McKinney wrote: > > On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris > > wrote: > >> > >> > >> On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe wrote: > >>> > >>> On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris > >>> wrote: > >>>> > >>>> > >>>> Just a quick comment, as this really needs more thought, but time is a > >>>> bag of worms. > >>> > >>> Certainly a bag of worms, I agree. > > +1 > > >>> > >>>> > >>>> Trying to represent some standard -- say seconds at the solar system > >>>> barycenter to account for general relativity -- is something that I > think is > >>>> too complicated and specialized to put into numpy. > >>>> > >>>> > >>>> > >>>> Good support for units and delta times is very useful, > >>> > >>> This part works fairly well now, except for some questions like what > >>> should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe > >>> "2011-02-28", or "2011-03-02"? > >>> > >>>> > >>>> but parsing dates and times > >>> > >>> I've implemented an almost-ISO 8601 date-time parser. I had to deviate > a > >>> bit to support years outside the little 10000-year window we use. I > think > >>> supporting more formats could be handled by writing a function which > takes > >>> its date format and outputs ISO 8601 format to feed numpy. > >> > >> Heh, 10000 years is way outside the window in which the Gregorian > calender > >> is useful anyway. Now, strict Julian days start at the creation of the > world > >> so that's a real standard, sort of like degrees Kelvins ;) > >> > >>>> > >>>> and handling timezones, daylight savings, > >>> > >>> The only timezone to handle is "local", which it looks like standard C > has > >>> a rich enough API for this to be ok. > >>> > >>>> > >>>> leap seconds, > >>> > >>> This one can be mitigated by referencing TAI or GPS time, I think. > POSIX > >>> time looks like a can of worms though. > >>> > >>>> > >>>> business days, > >>> > >>> I think Enthought has some interest in a decent business day > >>> functionality. Some feedback from people doing time series and > financial > >>> modeling would help clarify this though. This link may provide some > context: > >>> http://www.financialcalendar.com/Data/Holidays/Formats > >> > >> Well, that can change. Likewise, who wants to track ERS for the leap > >> seconds? > >> > >> I think the important thing is that the low level support needed to > >> implement such things is available. A package for common use cases that > >> serves as an example and that can be extended/modified by others would > also > >> be helpful. > >> > >> Chuck > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > > > Being involved in financial data analysis (e.g. via pandas) this stuff > > will affect me very directly. I'll get back to you soon with some > > comments. > > > > - Wes > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.shivers at gmail.com Thu Jun 2 10:57:48 2011 From: marc.shivers at gmail.com (Marc Shivers) Date: Thu, 2 Jun 2011 10:57:48 -0400 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 10:16 AM, Mark Wiebe wrote: > On Thu, Jun 2, 2011 at 7:25 AM, Marc Shivers wrote: >> >> Regarding business days, for financial data,?trading days are >> determined by each exchange, so, in particular, there is no such thing >> as a US Trading Calendar, only a NYSE Calendar, NYMEX Calendar, etc, >> etc... . ?I think it would be useful to have functionality where you >> could start with a weekday calendar with specified normal trading >> hours and easily remove a list of non-trading days that the user >> specifies as well as modify the days where the trading hours are >> shorter (e.g. the day after thanksgiving). > > What kind of functions/data would be needed for days with shorter trading > hours? The design I've sketched out supports everything you're describing > except things related to trading hours. Now that I'm thinking about it, maybe a good way to deal with this would be to just build two calendars, one for normal trading hours, and one for short trading hours, and then just stack and sort them, so maybe there's no need for a special method... >> >> I think anything more specific should probably be relegated to a >> specific package for financial calendars, as this entire topic gets >> very messy very fast. > > I definitely agree, NumPy would have no knowledge of specifics, just a > mechanism to tell it when holidays are, etc, and do reasonable calculations > on them. > -Mark >> >> -Marc >> >> On Wed, Jun 1, 2011 at 7:47 PM, Wes McKinney wrote: >> > On Wed, Jun 1, 2011 at 10:29 PM, Charles R Harris >> > wrote: >> >> >> >> >> >> On Wed, Jun 1, 2011 at 3:16 PM, Mark Wiebe wrote: >> >>> >> >>> On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris >> >>> wrote: >> >>>> >> >>>> >> >>>> Just a quick comment, as this really needs more thought, but time is >> >>>> a >> >>>> bag of worms. >> >>> >> >>> Certainly a bag of worms, I agree. >> >> +1 >> >> >>> >> >>>> >> >>>> Trying to represent some standard -- say seconds at the solar system >> >>>> barycenter to account for general relativity -- is something that I >> >>>> think is >> >>>> too complicated and specialized to put into numpy. >> >>>> >> >>>> >> >>>> >> >>>> Good support for units and delta times is very useful, >> >>> >> >>> This part works fairly well now, except for some questions like what >> >>> should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe >> >>> "2011-02-28", or "2011-03-02"? >> >>> >> >>>> >> >>>> but parsing dates and times >> >>> >> >>> I've implemented an almost-ISO 8601 date-time parser. ?I had to >> >>> deviate a >> >>> bit to support years outside the little 10000-year window we use. I >> >>> think >> >>> supporting more formats could be handled by writing a function which >> >>> takes >> >>> its date format and outputs ISO 8601 format to feed numpy. >> >> >> >> Heh, 10000 years is way outside the window in which the Gregorian >> >> calender >> >> is useful anyway. Now, strict Julian days start at the creation of the >> >> world >> >> so that's a real standard, sort of like degrees Kelvins ;) >> >> >> >>>> >> >>>> and handling timezones,?daylight savings, >> >>> >> >>> The only timezone to handle is "local", which it looks like standard C >> >>> has >> >>> a rich enough API for this to be ok. >> >>> >> >>>> >> >>>> leap seconds, >> >>> >> >>> This one can be mitigated by referencing TAI or GPS time, I think. >> >>> POSIX >> >>> time looks like a can of worms though. >> >>> >> >>>> >> >>>> business days, >> >>> >> >>> I think Enthought has some interest in a decent business day >> >>> functionality. Some feedback from people doing time series and >> >>> financial >> >>> modeling would help clarify this though. This link may provide some >> >>> context: >> >>> http://www.financialcalendar.com/Data/Holidays/Formats >> >> >> >> Well, that can change. Likewise, who wants to track ERS for the leap >> >> seconds? >> >> >> >> I think the important thing is that the low level support needed to >> >> implement such things is available. A package for common use cases that >> >> serves as an example and that can be extended/modified by others would >> >> also >> >> be helpful. >> >> >> >> Chuck >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion at scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> >> > >> > Being involved in financial data analysis (e.g. via pandas) this stuff >> > will affect me very directly. I'll get back to you soon with some >> > comments. >> > >> > - Wes >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From pgmdevlist at gmail.com Thu Jun 2 11:36:20 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 2 Jun 2011 17:36:20 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Jun 1, 2011, at 11:16 PM, Mark Wiebe wrote: > On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris wrote: > > > Just a quick comment, as this really needs more thought, but time is a bag of worms. > > Certainly a bag of worms, I agree. Oh yes... Keep in mind that the inclusion of time in numpy was tricky at the beginning. Travis O. started to work on that at the same time a GSoC student Jared and I were supervising a couple of years ago. This student tried to implement in numpy some of the ideas Matt Knox and I had put in scikits.timeseries. I myself tried to port the new dtype to scikits.timeseries, but ran into problems in the C side ( I've never been able to completely subclass an array in C...), and then life went in the way. Anyhow. I would advise you to check the very experimental git version of scikits.timeseries on github, it can give you some ideas of what (not) to do. > This part works fairly well now, except for some questions like what should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe "2011-02-28", or "2011-03-02"? Great example. What rules do you have for adding dates and deltas of different frequency ? What's the frequency of the result? > > but parsing dates and times > > I've implemented an almost-ISO 8601 date-time parser. I had to deviate a bit to support years outside the little 10000-year window we use. I think supporting more formats could be handled by writing a function which takes its date format and outputs ISO 8601 format to feed numpy. Good to have a nice parser independent of eGenix's, but I don't think it should be the first priority. Other comments: * I've never been really happy with the idea of putting a nb of events in the dtype. I think it makes things more complicated than they already are, for no big advantages (if you really need to keep track of several events falling in the same unit of time, why don't you make several arrays ?). If there's a will to change the API, I'd be glad to see this part dropped... * I quite agree with Mark W. It'd be great if there could be some basic mechanism to handle holidays in numpy, for example using the metadata. It'd be up to the user to decide what a holiday is (or holihour...). Wouldn't it be too much overhead for most users, though ? And those holidays would affect __add__/__sub__ and ? * As Chuck pointed out, users in different fields won't have the same expectations. In environmental sciences, there's no need for leap seconds nor business days. Astronomers and financial analysts have different needs... From Chris.Barker at noaa.gov Thu Jun 2 12:19:00 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 09:19:00 -0700 Subject: [Numpy-discussion] np.histogram: upper range bin In-Reply-To: References: Message-ID: <4DE7B7F4.9040705@noaa.gov> Peter Butterworth wrote: > in np.histogram the top-most bin edge is inclusive of the upper range > limit. As documented in the docstring (see below) this is actually the > expected behavior, but this can lead to some weird enough results: > > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) > Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) > > Is there any way round this or an alternative implementation without > this issue ? The way around it is what you've identified -- making sure your bins are right. But I think the current behavior is the way it "should" be. It keeps folks from inadvertently loosing stuff off the end -- the lower end is inclusive, so the upper end should be too. In the middle bins, one has to make an arbitrary cut-off, and put the values on the "line" somewhere. One thing to keep in mind is that, in general, histogram is designed for floating point numbers, not just integers -- counting integers can be accomplished other ways, if that's what you really want (see np.bincount). But back to your example: > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Why do you want only 3 bins here? using 4 gives you what you want. If you want more control, then it seems you really want to know how many of each of the values 1,2,3,4 there are. so you want 4 bins, each *centered* on the integers, so you might do: In [8]: np.histogram(x, bins=4, range=(0.5, 4.5)) Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) or, if you want to be more explicit: In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5)) Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From Chris.Barker at noaa.gov Thu Jun 2 12:22:24 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 09:22:24 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: <4DE7B8C0.4040805@noaa.gov> Charles R Harris wrote: > Good support for units and delta times is very useful, but > parsing dates and times and handling timezones, daylight savings, leap > seconds, business days, etc., is probably best served by addon packages > specialized to an area of interest. Just my $.02 I agree here -- I think for numpy, what's key is to focus on the kind of things needed for computational use -- that is the performance critical stuff. I suppose business-day type calculations would be both key and performance-critical, but that sure seems like the kind of thing that should go in an add-on package, rather than in numpy. The stdlib datetime package is a little bit too small for my taste (couldn't I at least as for a TimeDelta to be expressed in, say, seconds, without doing any math on my own?), but the idea is good -- create the core types, let add-on packages do the more specialized stuff. > * The existing datetime-related API is probably not useful, and in fact > those functions aren't used internally anymore. Is it reasonable to > remove the functions, or do we just deprecate them? I say remove, but some polling to see if anyone is using it might be in order first. > * Leap seconds probably deserve a rigorous treatment, but having an > internal representation with leap-seconds overcomplicates otherwise very > simple and fast operations. could you explain more? I don't get the issues -- leap seconds would com e in for calculations like: a_given_datetime + a_timedelta, correct? Given leap years, and all the other ugliness, does leap seconds really make it worse? > * Default conversion to string - should it be in UTC or with the local > timezone baked in? most date_time handling should be time-zone neutral --i.e. assume everything is in the same timezone (and daylight savings status). Libs that assume you want the locale setting do nothing but cause major pain if you have anything out of the ordinary to do (and sometimes ordinary stuff, too). If you MUST include time-zone, exlicite is better than implicit -- have the user specify, or, at the very least make it easy for the user to override any defaults. > As UTC it may be confusing because 'now' will print > as a different time than people would expect. I think "now" should be expressed (but also stored) in the local time, unless the user asks for UTC. This is consistent with the std lib datetime.now(), if nothing else. > * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. > NaT == NaT return false, etc. Should operations generating NaT trigger > an 'invalid' floating point exception in ufuncs? makes sense to me -- at least many folks are used to NaN symantics. > And after the removal of datetime from 1.4.1 and now this, I'd be in > favor of putting a large "experimental" sticker over the whole thing > until further notice. > > Do we have a good way to do that? Maybe a "experimental" warning, analogous to the "deprecation" warning. > Good support for units and delta times is very useful, > This part works fairly well now, except for some questions like what > should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe > "2011-02-28", or "2011-03-02"? Neither -- "month" should not be a valid unit to express a timedelta in. Nor should year, or anything else that is not clearly defined (we can argue about day, which does change a bit as the earth slows down, yes?) Yes, it's nice to be able to easily have a way of expressing things like every month, or "a month from now" when you mean a calendar month, but it's a heck of a can of worms. We just had a big discussion about this in the netcdf CF metadata standards list: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/007807.html We more or less came to the conclusion (I did, anyway) that there were two distinct, but related concepts: 1) time as a strict unit of measurement, like length, mass, etc. In that case, don't use "months" as a unit. 2) Calendars -- these are what months, days of week, etc, etc, etc. are from, and these get ugly. I also learned that there are even more calendars than I thought. Beyond the Julian, Gregorian, etc, there are special ones used for climate modeling and the like, that have nice properties like all months being 30 days long, etc. Plus, as discussed, various "business" calendars. So: I think that the calendar-related functions need fairly self contained library, with various classes for the various calendars one might want to use, and a well specified way to define new ones. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Thu Jun 2 12:22:32 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 11:22:32 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 10:36 AM, Pierre GM wrote: > > On Jun 1, 2011, at 11:16 PM, Mark Wiebe wrote: > > > On Wed, Jun 1, 2011 at 3:52 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > > > > > > Just a quick comment, as this really needs more thought, but time is a > bag of worms. > > > > Certainly a bag of worms, I agree. > > Oh yes... Keep in mind that the inclusion of time in numpy was tricky at > the beginning. Travis O. started to work on that at the same time a GSoC > student Jared and I were supervising a couple of years ago. This student > tried to implement in numpy some of the ideas Matt Knox and I had put in > scikits.timeseries. I myself tried to port the new dtype to > scikits.timeseries, but ran into problems in the C side ( I've never been > able to completely subclass an array in C...), and then life went in the > way. Anyhow. > I would advise you to check the very experimental git version of > scikits.timeseries on github, it can give you some ideas of what (not) to > do. Cool, I'll take a look. > > This part works fairly well now, except for some questions like what > should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe > "2011-02-28", or "2011-03-02"? > > Great example. What rules do you have for adding dates and deltas of > different frequency ? What's the frequency of the result? I'm following what I understand the NEP to mean for combining dates and deltas of different units. This means for timedeltas, the metadata becomes more precise, in particular it becomes the GCD of the input metadata, and between timedelta and datetime the datetime always dominates. https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst Only Years, Months, and Business Days have a nonlinear relationship with the other units, so they're the only problem case for this. They can be arbitrarily special-cased based on what is decided to make the most sense. As an aside, I don't actually like the name 'frequency' for the metadata, 'unit' sounds better to me. A frequency to me means a number with a particular form of unit, such as Hz or (1/s). > > > but parsing dates and times > > > > I've implemented an almost-ISO 8601 date-time parser. I had to deviate a > bit to support years outside the little 10000-year window we use. I think > supporting more formats could be handled by writing a function which takes > its date format and outputs ISO 8601 format to feed numpy. > > Good to have a nice parser independent of eGenix's, but I don't think it > should be the first priority. > Yeah, I wasn't planning on implementing such a parser at least in the near term. Other comments: > * I've never been really happy with the idea of putting a nb of events in > the dtype. I think it makes things more complicated than they already are, > for no big advantages (if you really need to keep track of several events > falling in the same unit of time, why don't you make several arrays ?). If > there's a will to change the API, I'd be glad to see this part dropped... > The NEP says that events were a requirement of a commercial sponsor, does anyone else have opinions about the feature? > * I quite agree with Mark W. It'd be great if there could be some basic > mechanism to handle holidays in numpy, for example using the metadata. It'd > be up to the user to decide what a holiday is (or holihour...). Wouldn't it > be too much overhead for most users, though ? And those holidays would > affect __add__/__sub__ and ? > In the current design, holidays would only be a feature of business days, nothing else. Sorry, no holihours. ;) I think the business days functionality might have to deviate from the general pattern quite a bit to work naturally for what it does. I imagine people will want to sometimes get a NaT, sometimes fall forward to the next valid business day, and sometimes fall backwards to the closest prior valid business day. > * As Chuck pointed out, users in different fields won't have the same > expectations. In environmental sciences, there's no need for leap seconds > nor business days. Astronomers and financial analysts have different > needs... Is there anyone watching this thread with specific needs for environmental sciences, astronomy, or other areas which might have an interest? -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 2 12:29:20 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 2 Jun 2011 11:29:20 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7B8C0.4040805@noaa.gov> References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 11:22, Christopher Barker wrote: > Charles R Harris wrote: >> * Leap seconds probably deserve a rigorous treatment, but having an >> internal representation with leap-seconds overcomplicates otherwise very >> simple and fast operations. > > could you explain more? I don't get the issues -- leap seconds would com > e in for calculations like: a_given_datetime + a_timedelta, correct? > Given leap years, and all the other ugliness, does leap seconds really > make it worse? Yes. Leap years are calculated by a fixed, reasonably efficient algorithm. Leap seconds are determined by committee every few years based on astronomical observations. We would need to keep a table of leap seconds up to date. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Thu Jun 2 12:42:06 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 11:42:06 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7B8C0.4040805@noaa.gov> References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 11:22 AM, Christopher Barker wrote: > Charles R Harris wrote: > > Good support for units and delta times is very useful, but > > parsing dates and times and handling timezones, daylight savings, leap > > seconds, business days, etc., is probably best served by addon packages > > specialized to an area of interest. Just my $.02 > > I agree here -- I think for numpy, what's key is to focus on the kind of > things needed for computational use -- that is the performance critical > stuff. > > I suppose business-day type calculations would be both key and > performance-critical, but that sure seems like the kind of thing that > should go in an add-on package, rather than in numpy. > I agree anything specific to particular workweeks or holidays conventions should be external, but the ability to specify them and do calculations with them seems reasonable to me for numpy. > The stdlib datetime package is a little bit too small for my taste > (couldn't I at least as for a TimeDelta to be expressed in, say, > seconds, without doing any math on my own?), but the idea is good -- > create the core types, let add-on packages do the more specialized stuff. > > > * The existing datetime-related API is probably not useful, and in fact > > those functions aren't used internally anymore. Is it reasonable to > > remove the functions, or do we just deprecate them? > > I say remove, but some polling to see if anyone is using it might be in > order first. > > > * Leap seconds probably deserve a rigorous treatment, but having an > > internal representation with leap-seconds overcomplicates otherwise very > > simple and fast operations. > > could you explain more? I don't get the issues -- leap seconds would com > e in for calculations like: a_given_datetime + a_timedelta, correct? > Given leap years, and all the other ugliness, does leap seconds really > make it worse? > Leap years are easy compared with leap seconds. Leap seconds involve a hardcoded table of particular leap-seconds that are added or subtracted, and are specified roughly 6 months in advance of when they happen by the International Earth Rotation and Reference Systems Service (IERS). The POSIX time_t doesn't store leap seconds, so if you subtract two time_t values you may get the wrong answer by up to 34 seconds (or 24, I'm not totally clear on the initial 10 second jump that happened somewhere). > * Default conversion to string - should it be in UTC or with the local > > timezone baked in? > > most date_time handling should be time-zone neutral --i.e. assume > everything is in the same timezone (and daylight savings status). Libs > that assume you want the locale setting do nothing but cause major pain > if you have anything out of the ordinary to do (and sometimes ordinary > stuff, too). > The conversion to string would be in ISO 8601 format with the local timezone offset baked in, so it would still be an unambiguous UTC datetime. It would just look better to people wanting to see local time. If you MUST include time-zone, exlicite is better than implicit -- have > the user specify, or, at the very least make it easy for the user to > override any defaults. > The ISO 8601 timezone format is very explicit. It's an offset in minutes, so any daylight savings, etc is already baked in and no knowledge of timezones is required to work with it. > As UTC it may be confusing because 'now' will print > > as a different time than people would expect. > > I think "now" should be expressed (but also stored) in the local time, > unless the user asks for UTC. This is consistent with the std lib > datetime.now(), if nothing else. > I personally dislike this approach, and much prefer having datetime unambiguously representing a particular time instead of depending on context. I much prefer printing with the timezone baked in to show it as local. > * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. > > NaT == NaT return false, etc. Should operations generating NaT trigger > > an 'invalid' floating point exception in ufuncs? > > makes sense to me -- at least many folks are used to NaN symantics. > > > And after the removal of datetime from 1.4.1 and now this, I'd be in > > favor of putting a large "experimental" sticker over the whole thing > > until further notice. > > > > Do we have a good way to do that? > > Maybe a "experimental" warning, analogous to the "deprecation" warning. > > > Good support for units and delta times is very useful, > > > This part works fairly well now, except for some questions like what > > should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe > > "2011-02-28", or "2011-03-02"? > > Neither -- "month" should not be a valid unit to express a timedelta in. > Nor should year, or anything else that is not clearly defined (we can > argue about day, which does change a bit as the earth slows down, yes?) > Why not? If a datetime is in months, adding a timedelta in months makes perfect sense. It's just crossing between units which aren't linearly related which is a problem. > Yes, it's nice to be able to easily have a way of expressing things like > every month, or "a month from now" when you mean a calendar month, but > it's a heck of a can of worms. > So perhaps raising an exception in these cases is preferable to having a default behavior. We just had a big discussion about this in the netcdf CF metadata > standards list: > > http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/007807.html > > We more or less came to the conclusion (I did, anyway) that there were > two distinct, but related concepts: > > 1) time as a strict unit of measurement, like length, mass, etc. In that > case, don't use "months" as a unit. > > 2) Calendars -- these are what months, days of week, etc, etc, etc. are > from, and these get ugly. I also learned that there are even more > calendars than I thought. Beyond the Julian, Gregorian, etc, there are > special ones used for climate modeling and the like, that have nice > properties like all months being 30 days long, etc. Plus, as discussed, > various "business" calendars. > > So: I think that the calendar-related functions need fairly self > contained library, with various classes for the various calendars one > might want to use, and a well specified way to define new ones. > The NumPy datetime is based on 1), using the metadata unit and stored as an offset from 1970-01-01, modulo leap-seconds in a currently non-rigorous way. For 2), when parsing/printing it uses a Gregorian calendar (extending beyond when it's defined historically) from what I understand. Other calendar systems would be for further add-on libraries to handle, I believe. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Thu Jun 2 12:57:52 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 09:57:52 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: <4DE7C110.3010903@noaa.gov> Mark Wiebe wrote: > I'm following what I understand the NEP to mean for combining dates and > deltas of different units. This means for timedeltas, the metadata > becomes more precise, in particular it becomes the GCD of the input > metadata, and between timedelta and datetime the datetime always dominates. > > https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst Thanks for posting this link -- a few comments on that doc follow. > Only Years, Months, and Business Days have a nonlinear relationship with > the other units, so they're the only problem case for this. They can be > arbitrarily special-cased based on what is decided to make the most sense. As mentioned on my recent post -- this stuff should be handles by some sort of "calendar" classes -- there is no one way to do that! So numpy should provide datetime and timedelta data types that can be used, but a timedelta should _not_ ever be defined by these weird variable units. I guess what I'm getting is that: a_date_time + a_timedelta is a fundamentally different operation than: a_date_time + a_calendar_defined_timespan The former can follow all the usual math properties for addition, but the later doesn't. About the NEP: """ A representation is also supported such that the stored date-time integer can encode both the number of a particular unit as well as a number of sequential events tracked for each unit. """ I'm not sure I understand what this really means, but I _think_ I agree with Pierre that this is unnecessary complication - couldn't it be handled by multiple arrays, or maybe a structured dtype? """ The datetime64 represents an absolute time. Internally it is represented as the number of time units between the intended time and the epoch (12:00am on January 1, 1970 --- POSIX time including its lack of leap seconds). """ The CF netcdf metadata standard provides for times to be specified as "units since a_date_time". units can be seconds, hours, days, etc (it does allow months and years, but it shouldn't!). This is nice, flexible system that makes it easy to capture wildly different scales needed: from nanoseconds to millennia. Similarly, we might want to consider a datetime dtype as containing a reference datetime, and a tic unit. I think the "Time units" section does specify that you can use various units, but it looks like the NEP sticks with the single POSIX epoch. I see later in the NEP: """ However, after thinking more about this, we found that the combination of an absolute datetime64 with a relative timedelta64 does offer the same functionality while removing the need for the additional origin metadata. This is why we have removed it from this proposal. """ hmmm -- I don't think that's the case -- you need the "origin" if you want to represent something like nanoseconds as a datetime, far away from the epoch. Sure, you can supply your own by keeping the origin and a timedelta array separately, by you could do that for all uses, also, and the point of this is to make working with datetimes easy. If we're going to allow different units, we might as well have different "origins". I also don't think that units like "month", "year", "business day" should be allowed -- it just adds confusion. It's not a killer if they are defined in the spec: 1 year = 365.25 days (for instance0 1 month = 1year/12 But I think it's better to simply disallow them, and keep that use for what I'm calling the "Calendar" functions. And "business day" is particularly ugly, and, I'm sure defined differently in different places. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From jsseabold at gmail.com Thu Jun 2 13:08:32 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Thu, 2 Jun 2011 13:08:32 -0400 Subject: [Numpy-discussion] bincount limitations In-Reply-To: <4DE6F0FA.4070805@gmail.com> References: <4DE6F0FA.4070805@gmail.com> Message-ID: On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac wrote: >> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller ?wrote: >>> Not quite. Bincount is fine if you have a set of approximately >>> sequential numbers. But if you don't.... > > > On 6/1/2011 9:35 PM, David Cournapeau wrote: >> Even worse, it fails miserably if you sequential numbers but with a high shift. >> np.bincount([100000001, 100000002]) # will take a lof of memory >> Doing bincount with dict is faster in those cases. > > > Since this discussion has turned shortcomings of bincount, > may I ask why np.bincount([]) is not an empty array? > Even more puzzling, why is np.bincount([],minlength=6) > not a 6-array of zeros? > Just looks like it wasn't coded that way, but it's low-hanging fruit. Any objections to adding this behavior? This commit should take care of it. Tests pass. Comments welcome, as I'm just getting my feet wet here. https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5 Skipper From mat.yeates at gmail.com Thu Jun 2 13:11:13 2011 From: mat.yeates at gmail.com (Mathew Yeates) Date: Thu, 2 Jun 2011 10:11:13 -0700 Subject: [Numpy-discussion] breaking array indices Message-ID: Hi I have indices into an array I'd like split so they are sequential e.g. [1,2,3,10,11] -> [1,2,3],[10,11] How do I do this? -Mathew From robert.kern at gmail.com Thu Jun 2 13:11:48 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 2 Jun 2011 12:11:48 -0500 Subject: [Numpy-discussion] bincount limitations In-Reply-To: References: <4DE6F0FA.4070805@gmail.com> Message-ID: On Thu, Jun 2, 2011 at 12:08, Skipper Seabold wrote: > On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac wrote: >>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller ?wrote: >>>> Not quite. Bincount is fine if you have a set of approximately >>>> sequential numbers. But if you don't.... >> >> >> On 6/1/2011 9:35 PM, David Cournapeau wrote: >>> Even worse, it fails miserably if you sequential numbers but with a high shift. >>> np.bincount([100000001, 100000002]) # will take a lof of memory >>> Doing bincount with dict is faster in those cases. >> >> >> Since this discussion has turned shortcomings of bincount, >> may I ask why np.bincount([]) is not an empty array? >> Even more puzzling, why is np.bincount([],minlength=6) >> not a 6-array of zeros? >> > > Just looks like it wasn't coded that way, but it's low-hanging fruit. > Any objections to adding this behavior? This commit should take care > of it. Tests pass. Comments welcome, as I'm just getting my feet wet > here. > > https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5 I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but otherwise, it looks good. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Thu Jun 2 13:15:10 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 12:15:10 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7C110.3010903@noaa.gov> References: <4DE7C110.3010903@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 11:57 AM, Christopher Barker wrote: > Mark Wiebe wrote: > > I'm following what I understand the NEP to mean for combining dates and > > deltas of different units. This means for timedeltas, the metadata > > becomes more precise, in particular it becomes the GCD of the input > > metadata, and between timedelta and datetime the datetime always > dominates. > > > > > https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst > > Thanks for posting this link -- a few comments on that doc follow. > > > Only Years, Months, and Business Days have a nonlinear relationship with > > the other units, so they're the only problem case for this. They can be > > arbitrarily special-cased based on what is decided to make the most > sense. > > As mentioned on my recent post -- this stuff should be handles by some > sort of "calendar" classes -- there is no one way to do that! So numpy > should provide datetime and timedelta data types that can be used, but a > timedelta should _not_ ever be defined by these weird variable units. > > I guess what I'm getting is that: > > a_date_time + a_timedelta > > is a fundamentally different operation than: > > a_date_time + a_calendar_defined_timespan > > The former can follow all the usual math properties for addition, but > the later doesn't. > It is possible to implement the system so that if you don't use Y/M/B, things work out unambiguously, but if you do use them you get a behavior that's a little weird, but with rules to eliminate the calendar-created ambiguities. For the business day unit, what I'm currently trying to do is get an assessment of whether my proposed design the right abstraction to support all the use cases of people who want it. About the NEP: > > """ > A representation is also supported such that the stored date-time > integer can encode both the number of a particular unit as well as a > number of sequential events tracked for each unit. > """ > > I'm not sure I understand what this really means, but I _think_ I agree > with Pierre that this is unnecessary complication - couldn't it be > handled by multiple arrays, or maybe a structured dtype? > I think it depends on its use cases, which unfortunately aren't actually described in the NEP. > """ > The datetime64 represents an absolute time. Internally it is represented > as the number of time units between the intended time and the epoch > (12:00am on January 1, 1970 --- POSIX time including its lack of leap > seconds). > """ > > The CF netcdf metadata standard provides for times to be specified as > "units since a_date_time". units can be seconds, hours, days, etc (it > does allow months and years, but it shouldn't!). This is nice, flexible > system that makes it easy to capture wildly different scales needed: > from nanoseconds to millennia. Similarly, we might want to consider a > datetime dtype as containing a reference datetime, and a tic unit. > > I think the "Time units" section does specify that you can use various > units, but it looks like the NEP sticks with the single POSIX epoch. > > I see later in the NEP: > """ > However, after thinking more about this, we found that the combination > of an absolute datetime64 with a relative timedelta64 does offer the > same functionality while removing the need for the additional origin > metadata. This is why we have removed it from this proposal. > """ > hmmm -- I don't think that's the case -- you need the "origin" if you > want to represent something like nanoseconds as a datetime, far away > from the epoch. Sure, you can supply your own by keeping the origin and > a timedelta array separately, by you could do that for all uses, also, > and the point of this is to make working with datetimes easy. If we're > going to allow different units, we might as well have different "origins". > I rather agree here, adding the 'origin' back in is definitely worth considering. How is the origin represented in the CF netcdf code? > I also don't think that units like "month", "year", "business day" > should be allowed -- it just adds confusion. It's not a killer if they > are defined in the spec: > > 1 year = 365.25 days (for instance0 > 1 month = 1year/12 > > But I think it's better to simply disallow them, and keep that use for > what I'm calling the "Calendar" functions. And "business day" is > particularly ugly, and, I'm sure defined differently in different places. So using the calendar specified by ISO 8601 as the default for the calendar-based functions is undesirable? I think supporting it to a small extent is reasonable, and support for any other calendars or more advanced calendar-based functions would go in support libraries. Having something calendar-like is already implied by calling the type "datetime". instead of just "timepoint" or something like that. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Thu Jun 2 13:17:17 2011 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 2 Jun 2011 12:17:17 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 11:42 AM, Mark Wiebe wrote: > On Thu, Jun 2, 2011 at 11:22 AM, Christopher Barker > wrote: > >> Charles R Harris wrote: >> > Good support for units and delta times is very useful, but >> > parsing dates and times and handling timezones, daylight savings, leap >> > seconds, business days, etc., is probably best served by addon packages >> > specialized to an area of interest. Just my $.02 >> >> I agree here -- I think for numpy, what's key is to focus on the kind of >> things needed for computational use -- that is the performance critical >> stuff. >> >> I suppose business-day type calculations would be both key and >> performance-critical, but that sure seems like the kind of thing that >> should go in an add-on package, rather than in numpy. >> > > I agree anything specific to particular workweeks or holidays conventions > should be external, but the ability to specify them and do calculations with > them seems reasonable to me for numpy. > > >> The stdlib datetime package is a little bit too small for my taste >> (couldn't I at least as for a TimeDelta to be expressed in, say, >> seconds, without doing any math on my own?), but the idea is good -- >> create the core types, let add-on packages do the more specialized stuff. >> >> > * The existing datetime-related API is probably not useful, and in fact >> > those functions aren't used internally anymore. Is it reasonable to >> > remove the functions, or do we just deprecate them? >> >> I say remove, but some polling to see if anyone is using it might be in >> order first. >> >> > * Leap seconds probably deserve a rigorous treatment, but having an >> > internal representation with leap-seconds overcomplicates otherwise very >> > simple and fast operations. >> >> could you explain more? I don't get the issues -- leap seconds would com >> e in for calculations like: a_given_datetime + a_timedelta, correct? >> Given leap years, and all the other ugliness, does leap seconds really >> make it worse? >> > > Leap years are easy compared with leap seconds. Leap seconds involve a > hardcoded table of particular leap-seconds that are added or subtracted, and > are specified roughly 6 months in advance of when they happen by the International > Earth Rotation and Reference Systems Service > (IERS). The POSIX time_t doesn't store leap seconds, so if you subtract > two time_t values you may get the wrong answer by up to 34 seconds (or 24, > I'm not totally clear on the initial 10 second jump that happened > somewhere). > > > * Default conversion to string - should it be in UTC or with the local >> > timezone baked in? >> >> most date_time handling should be time-zone neutral --i.e. assume >> everything is in the same timezone (and daylight savings status). Libs >> that assume you want the locale setting do nothing but cause major pain >> if you have anything out of the ordinary to do (and sometimes ordinary >> stuff, too). >> > > The conversion to string would be in ISO 8601 format with the local > timezone offset baked in, so it would still be an unambiguous UTC datetime. > It would just look better to people wanting to see local time. > > If you MUST include time-zone, exlicite is better than implicit -- have >> the user specify, or, at the very least make it easy for the user to >> override any defaults. >> > > The ISO 8601 timezone format is very explicit. It's an offset in minutes, > so any daylight savings, etc is already baked in and no knowledge of > timezones is required to work with it. > > > As UTC it may be confusing because 'now' will print >> > as a different time than people would expect. >> >> I think "now" should be expressed (but also stored) in the local time, >> unless the user asks for UTC. This is consistent with the std lib >> datetime.now(), if nothing else. >> > > I personally dislike this approach, and much prefer having datetime > unambiguously representing a particular time instead of depending on > context. I much prefer printing with the timezone baked in to show it as > local. > > > * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. >> > NaT == NaT return false, etc. Should operations generating NaT trigger >> > an 'invalid' floating point exception in ufuncs? >> >> makes sense to me -- at least many folks are used to NaN symantics. >> >> > And after the removal of datetime from 1.4.1 and now this, I'd be in >> > favor of putting a large "experimental" sticker over the whole thing >> > until further notice. >> > >> > Do we have a good way to do that? >> >> Maybe a "experimental" warning, analogous to the "deprecation" warning. >> >> > Good support for units and delta times is very useful, >> >> > This part works fairly well now, except for some questions like what >> > should datetime("2011-01-30", "D") + timedelta(1, "M") produce. Maybe >> > "2011-02-28", or "2011-03-02"? >> >> Neither -- "month" should not be a valid unit to express a timedelta in. >> Nor should year, or anything else that is not clearly defined (we can >> argue about day, which does change a bit as the earth slows down, yes?) >> > > Why not? If a datetime is in months, adding a timedelta in months makes > perfect sense. It's just crossing between units which aren't linearly > related which is a problem. > > >> Yes, it's nice to be able to easily have a way of expressing things like >> every month, or "a month from now" when you mean a calendar month, but >> it's a heck of a can of worms. >> > > So perhaps raising an exception in these cases is preferable to having a > default behavior. > > We just had a big discussion about this in the netcdf CF metadata >> standards list: >> >> http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/007807.html >> >> We more or less came to the conclusion (I did, anyway) that there were >> two distinct, but related concepts: >> >> 1) time as a strict unit of measurement, like length, mass, etc. In that >> case, don't use "months" as a unit. >> >> 2) Calendars -- these are what months, days of week, etc, etc, etc. are >> from, and these get ugly. I also learned that there are even more >> calendars than I thought. Beyond the Julian, Gregorian, etc, there are >> special ones used for climate modeling and the like, that have nice >> properties like all months being 30 days long, etc. Plus, as discussed, >> various "business" calendars. >> >> So: I think that the calendar-related functions need fairly self >> contained library, with various classes for the various calendars one >> might want to use, and a well specified way to define new ones. >> > > The NumPy datetime is based on 1), using the metadata unit and stored as an > offset from 1970-01-01, modulo leap-seconds in a currently non-rigorous way. > For 2), when parsing/printing it uses a Gregorian calendar (extending beyond > when it's defined historically) from what I understand. Other calendar > systems would be for further add-on libraries to handle, I believe. > > -Mark > > Just throwing it out there. Back in my Perl days (I do not often admit this), the one module that I thought was really well done was the datetime module. Now, I am not saying that we would need to support the calendar from Lord of the Rings or anything like that, but its floating timezone concept and other features were quite useful. Maybe there are lessons to be learned from there? Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Thu Jun 2 13:27:49 2011 From: shish at keba.be (Olivier Delalleau) Date: Thu, 2 Jun 2011 13:27:49 -0400 Subject: [Numpy-discussion] breaking array indices In-Reply-To: References: Message-ID: I think this does what you want: def seq_split(x): r = [0] + list(numpy.where(x[1:] != x[:-1] + 1)[0] + 1) + [None] return [x[r[i]:r[i + 1]] for i in xrange(len(r) - 1)] -=- Olivier 2011/6/2 Mathew Yeates > Hi > I have indices into an array I'd like split so they are sequential > e.g. > [1,2,3,10,11] -> [1,2,3],[10,11] > > How do I do this? > > -Mathew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsseabold at gmail.com Thu Jun 2 13:39:08 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Thu, 2 Jun 2011 13:39:08 -0400 Subject: [Numpy-discussion] bincount limitations In-Reply-To: References: <4DE6F0FA.4070805@gmail.com> Message-ID: On Thu, Jun 2, 2011 at 1:11 PM, Robert Kern wrote: > On Thu, Jun 2, 2011 at 12:08, Skipper Seabold wrote: >> On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac wrote: >>>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller ?wrote: >>>>> Not quite. Bincount is fine if you have a set of approximately >>>>> sequential numbers. But if you don't.... >>> >>> >>> On 6/1/2011 9:35 PM, David Cournapeau wrote: >>>> Even worse, it fails miserably if you sequential numbers but with a high shift. >>>> np.bincount([100000001, 100000002]) # will take a lof of memory >>>> Doing bincount with dict is faster in those cases. >>> >>> >>> Since this discussion has turned shortcomings of bincount, >>> may I ask why np.bincount([]) is not an empty array? >>> Even more puzzling, why is np.bincount([],minlength=6) >>> not a 6-array of zeros? >>> >> >> Just looks like it wasn't coded that way, but it's low-hanging fruit. >> Any objections to adding this behavior? This commit should take care >> of it. Tests pass. Comments welcome, as I'm just getting my feet wet >> here. >> >> https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5 > > I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but > otherwise, it looks good. > Ok, thanks. Made the change and removed the old test that it fails on empty. Pull request. https://github.com/numpy/numpy/pull/84 Skipper From mwwiebe at gmail.com Thu Jun 2 13:46:18 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 12:46:18 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 12:17 PM, Benjamin Root wrote: > > Just throwing it out there. Back in my Perl days (I do not often admit > this), the one module that I thought was really well done was the datetime > module. Now, I am not saying that we would need to support the calendar > from Lord of the Rings or anything like that, but its floating timezone > concept and other features were quite useful. Maybe there are lessons to be > learned from there? > > The perl stuff does look very extensive. Here's a sample link: http://www.perl.com/pub/2003/03/13/datetime.html This maybe clarifies a little something I wrote before, where I suggested the NumPy datetime is always UTC (or another unambiguous representation), and repr/print use local time as the display format, which is still unambiguous UTC but more human-readable. They give an example where adding 6 months to a local time can cross a daylight savings boundary, and having 'now' give back a local time instead of UTC would imply that there is some kind of functionality like this in there, very confusing. Better to leave local time as just a display format. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 2 14:07:33 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 2 Jun 2011 12:07:33 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7C110.3010903@noaa.gov> References: <4DE7C110.3010903@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 10:57 AM, Christopher Barker wrote: > Mark Wiebe wrote: > > I'm following what I understand the NEP to mean for combining dates and > > deltas of different units. This means for timedeltas, the metadata > > becomes more precise, in particular it becomes the GCD of the input > > metadata, and between timedelta and datetime the datetime always > dominates. > > > > > https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst > > Thanks for posting this link -- a few comments on that doc follow. > > > Only Years, Months, and Business Days have a nonlinear relationship with > > the other units, so they're the only problem case for this. They can be > > arbitrarily special-cased based on what is decided to make the most > sense. > > As mentioned on my recent post -- this stuff should be handles by some > sort of "calendar" classes -- there is no one way to do that! So numpy > should provide datetime and timedelta data types that can be used, but a > timedelta should _not_ ever be defined by these weird variable units. > > I guess what I'm getting is that: > > a_date_time + a_timedelta > > is a fundamentally different operation than: > > a_date_time + a_calendar_defined_timespan > > The former can follow all the usual math properties for addition, but > the later doesn't. > > About the NEP: > > """ > A representation is also supported such that the stored date-time > integer can encode both the number of a particular unit as well as a > number of sequential events tracked for each unit. > """ > > I'm not sure I understand what this really means, but I _think_ I agree > with Pierre that this is unnecessary complication - couldn't it be > handled by multiple arrays, or maybe a structured dtype? > > """ > The datetime64 represents an absolute time. Internally it is represented > as the number of time units between the intended time and the epoch > (12:00am on January 1, 1970 --- POSIX time including its lack of leap > seconds). > """ > > The CF netcdf metadata standard provides for times to be specified as > "units since a_date_time". units can be seconds, hours, days, etc (it > does allow months and years, but it shouldn't!). This is nice, flexible > system that makes it easy to capture wildly different scales needed: > from nanoseconds to millennia. Similarly, we might want to consider a > datetime dtype as containing a reference datetime, and a tic unit. > > I think the "Time units" section does specify that you can use various > units, but it looks like the NEP sticks with the single POSIX epoch. > > I see later in the NEP: > """ > However, after thinking more about this, we found that the combination > of an absolute datetime64 with a relative timedelta64 does offer the > same functionality while removing the need for the additional origin > metadata. This is why we have removed it from this proposal. > """ > hmmm -- I don't think that's the case -- you need the "origin" if you > want to represent something like nanoseconds as a datetime, far away > from the epoch. Sure, you can supply your own by keeping the origin and > a timedelta array separately, by you could do that for all uses, also, > and the point of this is to make working with datetimes easy. If we're > going to allow different units, we might as well have different "origins". > > +1 > > I also don't think that units like "month", "year", "business day" > should be allowed -- it just adds confusion. It's not a killer if they > are defined in the spec: > > 1 year = 365.25 days (for instance0 > 1 month = 1year/12 > > But I think it's better to simply disallow them, and keep that use for > what I'm calling the "Calendar" functions. And "business day" is > particularly ugly, and, I'm sure defined differently in different places. > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Thu Jun 2 14:09:16 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 2 Jun 2011 20:09:16 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7C110.3010903@noaa.gov> References: <4DE7C110.3010903@noaa.gov> Message-ID: <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> On Jun 2, 2011, at 6:57 PM, Christopher Barker wrote: > Mark Wiebe wrote: >> I'm following what I understand the NEP to mean for combining dates and >> deltas of different units. This means for timedeltas, the metadata >> becomes more precise, in particular it becomes the GCD of the input >> metadata, and between timedelta and datetime the datetime always dominates. >> >> https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst > > Thanks for posting this link -- a few comments on that doc follow. > >> Only Years, Months, and Business Days have a nonlinear relationship with >> the other units, so they're the only problem case for this. They can be >> arbitrarily special-cased based on what is decided to make the most sense. > > As mentioned on my recent post -- this stuff should be handles by some > sort of "calendar" classes -- there is no one way to do that! So numpy > should provide datetime and timedelta data types that can be used, but a > timedelta should _not_ ever be defined by these weird variable units. > About the NEP: > > """ > A representation is also supported such that the stored date-time > integer can encode both the number of a particular unit as well as a > number of sequential events tracked for each unit. > """ > > I'm not sure I understand what this really means, but I _think_ I agree > with Pierre that this is unnecessary complication - couldn't it be > handled by multiple arrays, or maybe a structured dtype? My point. Mark W., perhaps you should ask the folks at Enthought (Travis O. in particular) what they had in mind ? Whether the original interest is still there. If it is, I wonder we can't find some workaround that would drop support for that. > If we're > going to allow different units, we might as well have different "origins". +1 > > I also don't think that units like "month", "year", "business day" > should be allowed -- it just adds confusion. It's not a killer if they > are defined in the spec: -1 In scikits.timeseries, not only did we have years,months and business days, but we also had weeks, that proved quite useful sometimes, and that were discarded in the discussion of the new NEP a few years back. Anyhow, years and months are simple enough. In case of ambiguities like Mark's example, we can always raise an exception. ISO8601 seems quite OK. Support for multiple time zones, daylight saving conventions, holidays and so forth could be let to specific packages. From mat.yeates at gmail.com Thu Jun 2 14:16:08 2011 From: mat.yeates at gmail.com (Mathew Yeates) Date: Thu, 2 Jun 2011 11:16:08 -0700 Subject: [Numpy-discussion] breaking array indices In-Reply-To: References: Message-ID: thanks. My solution was much hackier. -Mathew On Thu, Jun 2, 2011 at 10:27 AM, Olivier Delalleau wrote: > I think this does what you want: > > def seq_split(x): > ? r = [0] + list(numpy.where(x[1:] != x[:-1] + 1)[0] + 1) + [None] > ? return [x[r[i]:r[i + 1]] for i in xrange(len(r) - 1)] > > -=- Olivier > > 2011/6/2 Mathew Yeates >> >> Hi >> I have indices into an array I'd like split so they are sequential >> e.g. >> [1,2,3,10,11] -> [1,2,3],[10,11] >> >> How do I do this? >> >> -Mathew >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From Chris.Barker at noaa.gov Thu Jun 2 14:24:43 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 11:24:43 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: <4DE7D56B.3010701@noaa.gov> Benjamin Root wrote: > Just throwing it out there. Back in my Perl days (I do not often admit > this), the one module that I thought was really well done was the > datetime module. Now, I am not saying that we would need to support the > calendar from Lord of the Rings or anything like that, but its floating > timezone concept and other features were quite useful. Maybe there are > lessons to be learned from there? Another one to look at is the classes in the Earth System Modeling Framework (ESMF). That will give you a good idea what the atmospheric/oceanographic/climate modelers need, and how at least one well-respected group has addresses these issues. http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_5_1_0/ESMC_crefdoc/node6.html (If that long URL breaks, I found that by googling: "ESMF calendar date time") -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From Chris.Barker at noaa.gov Thu Jun 2 14:44:07 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 11:44:07 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> References: <4DE7C110.3010903@noaa.gov> <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> Message-ID: <4DE7D9F7.70708@noaa.gov> Pierre GM wrote: >> I also don't think that units like "month", "year", "business day" >> should be allowed -- it just adds confusion. It's not a killer if they >> are defined in the spec: > > -1 In scikits.timeseries, not only did we have years,months and > business days, but we also had weeks, that proved quite useful sometimes, Actually, I have little issue with weeks -- it can be clearly defined as 7 * 24 hours. The problem with months is that they are not all the same length. One of the issues pointed out in the discussion on the CF list was that for the most part, scientist expect that a time series has nice properties, like differentiability, etc. That all goes to heck if you have months as a unit. Unless a month is clearly defined as a particular number of hours, and ONLY means that, but then some folks may not get what they expect. > Anyhow, years and months are simple enough. no, they are not -- they are fundamentally different than hours, days, etc. > In case of ambiguities like Mark's example, we can always raise an exception. sure -- but what I'm suggesting is that rather than have one thing (a datetime array) that behaves differently according to what units it happens to use, we clearly separate what I'm calling Calendar functionality for basic time functionality. What that means to me is that a TimeDelta is ALWAYS an unambiguous time unit, and a datetime is ALWAYS expressed as an unambiguous time unit since a well-defined epoch. If you want something like "6 months after a datetime", or "14 business days after a datetime" that should be handles by a calendar class of some sort. (hmm, maybe I'm being too pedantic here -- the Calendar class could be associated with the datetime and timedelta objects, I suppose) so a given timedelta "knows" how to do math with itself. But I think this is fraught with problems: one of the keys to usability ond performance of the datetiem arrays is that they are really arrays of integers, and you can do math with them just as though they are integers. That will break if you allow these non-linear time steps. So, thinking out loud here, maybe: datetime timedelta calendartimedelta to make it very clear what one is working with? as for "calendardatetime", at first blush, I don't think there is a need. Do you need to express a datetime as "n months from the epoch", rather than "12:00am on July 1, 1970" -- which could be represented in hours, minutes, seconds, etc? What I'm getting at is that the difference between calendars is in what timedeltas mean, not what a unit in time is. > ISO8601 seems quite OK. All that does is specify a string representation, no? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Thu Jun 2 14:44:39 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 13:44:39 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> References: <4DE7C110.3010903@noaa.gov> <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> Message-ID: On Thu, Jun 2, 2011 at 1:09 PM, Pierre GM wrote: > > > > """ > > A representation is also supported such that the stored date-time > > integer can encode both the number of a particular unit as well as a > > number of sequential events tracked for each unit. > > """ > > > > I'm not sure I understand what this really means, but I _think_ I agree > > with Pierre that this is unnecessary complication - couldn't it be > > handled by multiple arrays, or maybe a structured dtype? > > My point. Mark W., perhaps you should ask the folks at Enthought (Travis O. > in particular) what they had in mind ? Whether the original interest is > still there. If it is, I wonder we can't find some workaround that would > drop support for that. I'll ask, for sure. > > If we're > > going to allow different units, we might as well have different > "origins". > > +1 > > > > > I also don't think that units like "month", "year", "business day" > > should be allowed -- it just adds confusion. It's not a killer if they > > are defined in the spec: > > -1 > In scikits.timeseries, not only did we have years,months and business days, > but we also had weeks, that proved quite useful sometimes, and that were > discarded in the discussion of the new NEP a few years back. > There are weeks in the implementation. One of my points in the original email was whether to adjust week's epoch by a few days so they align with ISO8601 weeks, do you have an opinion about that? -Mark > Anyhow, years and months are simple enough. In case of ambiguities like > Mark's example, we can always raise an exception. ISO8601 seems quite OK. > Support for multiple time zones, daylight saving conventions, holidays and > so forth could be let to specific packages. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 2 14:56:56 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 2 Jun 2011 12:56:56 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7D9F7.70708@noaa.gov> References: <4DE7C110.3010903@noaa.gov> <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> <4DE7D9F7.70708@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 12:44 PM, Christopher Barker wrote: > Pierre GM wrote: > >> I also don't think that units like "month", "year", "business day" > >> should be allowed -- it just adds confusion. It's not a killer if they > >> are defined in the spec: > > > > -1 In scikits.timeseries, not only did we have years,months and > > business days, but we also had weeks, that proved quite useful sometimes, > > Actually, I have little issue with weeks -- it can be clearly defined as > 7 * 24 hours. The problem with months is that they are not all the same > length. > > One of the issues pointed out in the discussion on the CF list was that > for the most part, scientist expect that a time series has nice > properties, like differentiability, etc. That all goes to heck if you > have months as a unit. Unless a month is clearly defined as a particular > number of hours, and ONLY means that, but then some folks may not get > what they expect. > > > Anyhow, years and months are simple enough. > > no, they are not -- they are fundamentally different than hours, days, etc. > > > In case of ambiguities like Mark's example, we can always raise an > exception. > > sure -- but what I'm suggesting is that rather than have one thing (a > datetime array) that behaves differently according to what units it > happens to use, we clearly separate what I'm calling Calendar > functionality for basic time functionality. What that means to me is > that a TimeDelta is ALWAYS an unambiguous time unit, and a datetime is > ALWAYS expressed as an unambiguous time unit since a well-defined epoch. > > If you want something like "6 months after a datetime", or "14 business > days after a datetime" that should be handles by a calendar class of > some sort. > > (hmm, maybe I'm being too pedantic here -- the Calendar class could be > associated with the datetime and timedelta objects, I suppose) so a > given timedelta "knows" how to do math with itself. But I think this is > fraught with problems: > > one of the keys to usability ond performance of the datetiem arrays is > that they are really arrays of integers, and you can do math with them > just as though they are integers. That will break if you allow these > non-linear time steps. > > So, thinking out loud here, maybe: > > datetime > timedelta > calendartimedelta > > to make it very clear what one is working with? > > as for "calendardatetime", at first blush, I don't think there is a > need. Do you need to express a datetime as "n months from the epoch", > rather than "12:00am on July 1, 1970" -- which could be represented in > hours, minutes, seconds, etc? > > What I'm getting at is that the difference between calendars is in what > timedeltas mean, not what a unit in time is. > > > I agree with the points that Chris is raising. And I would stick UTC in the calender category on account of the leap seconds. That's not to say that numpy shouldn't offer at least one 'calender' class, but that the computational core needs to be kept flexible and not tie itself to something like UTC that will in some circumstances require translation to linear units. How to make that separation efficient is another question that perhaps touches on labeled arrays, or arrays with units, or something in that general category. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Thu Jun 2 15:07:26 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Thu, 2 Jun 2011 21:07:26 +0200 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 5:45 PM, martin smith wrote: > I have a bit of code that performs multi-taper power spectra using numpy > and a C extension module. The C portion consists of an interface file > and a python-unaware computational file. The latter invokes fftpack. > > The straightforward setup.py appended below works fine on Linux. On > Windows using MinGW it's not so good. I first created a libpython27.a > file following instructions on the web. The C code now compiles but > fails to link since there is no fftpack_lite.so on Windows. I can't > find an fftpack_lite.dll (or even the source for fftpack_lite) and don't > know where to turn. > Python dll's on Windows have a .pyd extension, there should be a fftpack_lite.pyd file. But you should be able to just leave off the extension, so it does the right thing on any platform. Also, the way you specified library_dirs doesn't work if numpy is not in site-packages, better to use something like: library_dirs = [os.path.join(numpy.__path__[0], 'fft')] > Trying to sort these problems out in Windows is (to me) difficult and > bewildering. I'd much appreciate any advice. > > - martin smith > > (BTW: the code is unencumbered and available for the asking.) > > After search for multi-taper I found http://projects.scipy.org/scipy/ticket/608. Too bad no one has gotten around to review your contribution before, it sounds like a good addition for the scipy.signal module. Cheers, Ralf > > ---------------------------------------------------------------------- > from numpy.distutils.core import setup > from numpy.distutils.extension import Extension > import os.path > import distutils.sysconfig as sf > > use_fftpack = 1 > > if use_fftpack: > > pymutt_module = Extension( > 'pymutt', > ['pymutt.c'], > library_dirs = [os.path.join(sf.get_python_lib(), 'numpy', 'fft')], > define_macros = [('USE_FFTPACK', None)], > libraries = [':fftpack_lite.so'], > ) > > else: > > pymutt_module = Extension( > 'pymutt', > ['pymutt.c'], > libraries = ['fftw3'], > ) > > setup(name='pymutt', > version='1.0', > description = "numpy support for multi-taper fourier transforms", > author = "Martin L. Smith", > long_description = > ''' > Implements a numpy interface to an implementation of Thomson's > multi-taper fourier transform algorithms. The key C module (mtbase.c) > is derived from code written and made freely available by J. M. Lees and > Jeff Park. > ''', > ext_modules=[pymutt_module], > ) > > ---------------------------------------------------------------------------------------- > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 2 15:33:47 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 2 Jun 2011 14:33:47 -0500 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 14:07, Ralf Gommers wrote: > > On Wed, Jun 1, 2011 at 5:45 PM, martin smith wrote: >> >> I have a bit of code that performs multi-taper power spectra using numpy >> and a C extension module. ?The C portion consists of an interface file >> and a python-unaware computational file. ?The latter invokes fftpack. >> >> The straightforward setup.py appended below works fine on Linux. ?On >> Windows using MinGW it's not so good. I first created a libpython27.a >> file following instructions on the web. ?The C code now compiles but >> fails to link since there is no fftpack_lite.so on Windows. ?I can't >> find an fftpack_lite.dll (or even the source for fftpack_lite) and don't >> know where to turn. > > Python dll's on Windows have a .pyd extension, there should be a > fftpack_lite.pyd file. But you should be able to just leave off the > extension, so it does the right thing on any platform. > > Also, the way you specified library_dirs doesn't work if numpy is not in > site-packages, better to use something like: > ???? library_dirs = [os.path.join(numpy.__path__[0], 'fft')] That said, there is no good cross-platform way to link against other Python extension modules. Please do not try. You will have to include a copy of the FFTPACK code in your own extension module. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From Chris.Barker at noaa.gov Thu Jun 2 15:45:37 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 02 Jun 2011 12:45:37 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7C110.3010903@noaa.gov> Message-ID: <4DE7E861.70007@noaa.gov> Mark Wiebe wrote: > It is possible to implement the system so that if you don't use Y/M/B, > things work out unambiguously, but if you do use them you get a behavior > that's a little weird, but with rules to eliminate the calendar-created > ambiguities. yes, but everyone wants different rules -- so it needs to be very clear which rules are in place, and there needs to be a way for a user to specify his/her own rules. > For the business day unit, what I'm currently trying to do > is get an assessment of whether my proposed design the right abstraction > to support all the use cases of people who want it. Good plan. > I rather agree here, adding the 'origin' back in is definitely worth > considering. How is the origin represented in the CF netcdf code? As a ISO 1601 string. I can't recall if you have the option of specifying a non-standard calendar. > So using the calendar specified by ISO 8601 as the default for the > calendar-based functions is undesirable? no -- that's fine -- but does ISO 8601 specify stuff like business day? > I think supporting it to a > small extent is reasonable, and support for any other calendars or more > advanced calendar-based functions would go in support libraries. yup -- what I'm trying to press here is the distinction between linear time units and the "weird" concepts, like business day, month, etc. I think there are two related, but distinct issues: 1) representation/specification of a "datetime". The idea here is that imagine that there is a continuous property called time (which I suppose has a zero at the Big Bang). We need a way to define where (when) in that continuum a given event, or set of events occurred. This is what the datetime dtype is about. I think the standard of "some-time-unit since some-reference-datetime, in some-calendar" is fine, but that the time-unit should be unambiguously and clearly defined, and not change with when it occurs, i.e. seconds, hours, days, but not months, years, or business days. 2) time spans, and math with time: i.e. timedeltas --- this falls into 2 categories: a) simple linear time units: seconds, hours, etc. This is quite straightforward, if working with other time deltas and datetimes all expressed in well-defined linear units. b) calendar manipulations: "months since", "business days since", once a month, "the first sunday of teh month", "next monday". These require a well defined and complex Calendar, and there are many possible such Calendars. What I'm suggesting is that (a) and (b) should be kept quite distinct, and that it should be fairly easy to define and use custom Calendars defined for (b). (a) and (b) could be merged, with various defaults and exceptions raised for poorly defined operations, but I think that'll be less clear, harder to implement, and more prone to error. A little example, again from the CF mailing list (which spawned the discussion). In the CF standard the units available are defined as "those supported by the udunits library": http://www.unidata.ucar.edu/software/udunits/ It turns out that udunits only supports time manipulation as I specified as (a) i.e. only clearly defined linear time units. However, they do define "months" and "years", as specific values (something like 365.25 days/year and 12 months/year -- though they also have "Julian-year", "leap_year", etc) So folks would specify a time axes as : "months since 2010-01" and expect that they were getting calandar months, like "1" would mean Feb, 2010, instaed of January 31, 2010 (or whatever). Anyway, lots of room for confusion, so whatever we come up with needs to be clearly defined. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From robert.kern at gmail.com Thu Jun 2 15:57:37 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 2 Jun 2011 14:57:37 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DE7D9F7.70708@noaa.gov> References: <4DE7C110.3010903@noaa.gov> <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> <4DE7D9F7.70708@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 13:44, Christopher Barker wrote: > Pierre GM wrote: >> Anyhow, years and months are simple enough. > > no, they are not -- they are fundamentally different than hours, days, etc. That doesn't make them *difficult*. It's tricky to convert between months and hours, yes, but that's not the only operation that we're looking to represent. A *ton* of important time series are in these calendrical units (monthly, semi-monthly, quarterly, yearly, etc.). If you collect economic data on a monthly basis, you don't need to unambiguously convert "January 2011" to a microsecond timestamp. You do need to be able to "add 6 months", "add 1 year", and "roll up each 3 month period into quarters", etc. Each of those operations *is* unambiguous. The data simply doesn't represent microsecond-sized events. The machinery to handle both is basically the same inside their areas of applicability; you just have to disallow certain ambiguous conversions between them, as Pierre suggests. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Thu Jun 2 16:06:58 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 15:06:58 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule Message-ID: Would anyone object to, at least temporarily, tightening up the default ufunc casting rule to 'same_kind' in NumPy master? It's a one line change, so would be easy to undo, but such a change is very desirable in my opinion. This would raise an exception, since it's np.add(a, 1.9, out=a), converting a float to an int: >>> a = np.arange(3, dtype=np.int32) >>> a += 1.9 It's also relevant to the discussion of nonlinear datetime units and datetime in general. Currently, it will silently convert a months or years delta to a days delta. I've made the conversions between non-linear metadata not satisfy the 'same_kind' rule, so they would be disallowed by default as well (but easily enabled with a casting='unsafe' parameter). By doing this change, we can see what other code it affects, such as scipy, and patch that code to be safe for this convention whether or not it is adopted in the next release. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Jun 2 16:09:45 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 2 Jun 2011 22:09:45 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: Message-ID: <20110602200945.GG2554@phare.normalesup.org> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: > Would anyone object to, at least temporarily, tightening up the default > ufunc casting rule to 'same_kind' in NumPy master? It's a one line change, > so would be easy to undo, but such a change is very desirable in my > opinion. > This would raise an exception, since it's np.add(a, 1.9, out=a), > converting a float to an int: > >>> a = np.arange(3, dtype=np.int32) > >>> a += 1.9 That's probably going to break a huge amount of code which relies on the current behavior. Am I right in believing that this should only be considered for a major release of numpy, say numpy 2.0? Gael From mwwiebe at gmail.com Thu Jun 2 16:12:11 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 2 Jun 2011 15:12:11 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: <20110602200945.GG2554@phare.normalesup.org> References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: > > Would anyone object to, at least temporarily, tightening up the > default > > ufunc casting rule to 'same_kind' in NumPy master? It's a one line > change, > > so would be easy to undo, but such a change is very desirable in my > > opinion. > > This would raise an exception, since it's np.add(a, 1.9, out=a), > > converting a float to an int: > > > >>> a = np.arange(3, dtype=np.int32) > > > >>> a += 1.9 > > That's probably going to break a huge amount of code which relies on the > current behavior. > > Am I right in believing that this should only be considered for a major > release of numpy, say numpy 2.0? Absolutely, and that's why I'm proposing to do it in master now, fairly early in a development cycle, so we can evaluate its effects. If the next version is 1.7, we probably would roll it back for release (a 1 line change), and if the next version is 2.0, we probably would keep it in. I suspect at least some of the code relying on the current behavior may have bugs, and tightening this up is a way to reveal them. -Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 2 17:40:07 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 2 Jun 2011 16:40:07 -0500 Subject: [Numpy-discussion] New functions. In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 22:06, Travis Oliphant wrote: > > On May 31, 2011, at 8:08 PM, Charles R Harris wrote: >> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible. > > +0 ?--- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. ? This needs more discussion, but useful to mention in this context. Actually, these are completely orthogonal to Chuck's proposal. The NA-enabled dtypes (*not* NaN-enabled dtypes) would have (x + NA) == NA, just like R. fadd() would be useful for other things. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From srean.list at gmail.com Thu Jun 2 21:36:33 2011 From: srean.list at gmail.com (srean) Date: Thu, 2 Jun 2011 20:36:33 -0500 Subject: [Numpy-discussion] [Repost] Adding the arrays returned in an array iterator Message-ID: Bumping my question tentatively. I am fairly sure there is a good answer and for some reason it got overlooked. Regards srean ---------- Forwarded message ---------- From: srean Date: Fri, May 27, 2011 at 10:36 AM Subject: Adding the arrays in an array iterator To: Discussion of Numerical Python Hi List, I have to sum up an unknown number of ndarrays of the same size. These arrays, possibly thousands in number, are provided via an iterator. Right now I use python reduce with operator.add. Does that invoke the corresponding ufunc internally ? I want to avoid creating temporaries, which I suspect a naive invocation of reduce will create. With ufunc I know how to avoid making copies using the output parameter, but not sure how to make use of that in reduce. It is not essential that I use reduce though, so I would welcome idiomatic and efficient way of executing this. So far I have stayed away from building an ndarray object and summing across the relevant dimension. Is that what I should be doing ? Different invocations of this function has different number of arrays, so I cannot pre-compile away this into a numexpr. Thanks and regards srean -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Fri Jun 3 00:56:39 2011 From: Chris.Barker at noaa.gov (Chris Barker) Date: Thu, 02 Jun 2011 21:56:39 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7C110.3010903@noaa.gov> <81D865E7-9E9A-4D5F-84C4-5A554CEF2F84@gmail.com> <4DE7D9F7.70708@noaa.gov> Message-ID: <4DE86987.3080102@noaa.gov> On 6/2/11 12:57 PM, Robert Kern wrote: >>> Anyhow, years and months are simple enough. >> >> no, they are not -- they are fundamentally different than hours, days, etc. > > That doesn't make them *difficult*. I won't comment on how difficult it is -- I'm not writing the code. My core point is that they are different, and that should be clear in the API. > It's tricky to convert between > months and hours, yes, but that's not the only operation that we're > looking to represent. A *ton* of important time series are in these > calendrical units (monthly, semi-monthly, quarterly, yearly, etc.). If > you collect economic data on a monthly basis, you don't need to > unambiguously convert "January 2011" to a microsecond timestamp. You > do need to be able to "add 6 months", "add 1 year", and "roll up each > 3 month period into quarters", etc. Each of those operations *is* > unambiguous. I'm not all that sure that there isn't ambiguity in there, actually. But anyway, the ambiguity really shows up if people use the SAME dtypes, APIs, etc for these types of operations as they do for regular old time series. > The data simply doesn't represent microsecond-sized > events. true -- and this brings up another point -- would it be useful to have a way to capture the precision? Kind of like significant figures? I suppose that is implied by ones choice of units in a datetime: "weeks since a datetime" certainly implies something different than "microseconds since a datetime", but maybe it should be more formal. This is where the proposed approach is better than the CF standard -- in that case, most folks use floating point units, so the concept of precision is lost, even if your time units are hours, rather than micro seconds. > The machinery to handle both is basically the same inside their areas > of applicability; I'm confused about that -- I expect the machinery to be quite different for "Calendar" calculations than for "straight time" calculations. months, years, etc are different lengths depending on the when they occur. Or are you suggesting that if we keep the precision never higher than it should be, then it's easy? i.e. if you are storing your datetimes and time deltas as integer months, then you don't need to care that some months are 30 days, and some 31 (or 28, or 29). I'd have to think on that more -- my experience is not with financial data -- so that may make sense there, but in the sciences I'm familiar with (hydrology, oceanography, meteorology, ...), people are very quick to convert months to days or hours, and want to differentiate time series, etc, and then this stuff does get messy. And this applies to what is similar to economic data; monthly rainfall, monthly means wind speed, whatever. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From ralf.gommers at googlemail.com Fri Jun 3 05:15:04 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Fri, 3 Jun 2011 11:15:04 +0200 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 9:33 PM, Robert Kern wrote: > On Thu, Jun 2, 2011 at 14:07, Ralf Gommers > wrote: > > > > On Wed, Jun 1, 2011 at 5:45 PM, martin smith > wrote: > >> > >> I have a bit of code that performs multi-taper power spectra using numpy > >> and a C extension module. The C portion consists of an interface file > >> and a python-unaware computational file. The latter invokes fftpack. > >> > >> The straightforward setup.py appended below works fine on Linux. On > >> Windows using MinGW it's not so good. I first created a libpython27.a > >> file following instructions on the web. The C code now compiles but > >> fails to link since there is no fftpack_lite.so on Windows. I can't > >> find an fftpack_lite.dll (or even the source for fftpack_lite) and don't > >> know where to turn. > > > > Python dll's on Windows have a .pyd extension, there should be a > > fftpack_lite.pyd file. But you should be able to just leave off the > > extension, so it does the right thing on any platform. > > > > Also, the way you specified library_dirs doesn't work if numpy is not in > > site-packages, better to use something like: > > library_dirs = [os.path.join(numpy.__path__[0], 'fft')] > > That said, there is no good cross-platform way to link against other > Python extension modules. Please do not try. You will have to include > a copy of the FFTPACK code in your own extension module. > > Coming back to #608, that means there is no chance that the C version will land in scipy, correct? We're not going to ship two copies of FFTPACK. So the answer should be "rewrite in Python, if that's too slow use Cython". Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonas.ruebsam at web.de Fri Jun 3 09:30:18 2011 From: jonas.ruebsam at web.de (jgrub) Date: Fri, 3 Jun 2011 06:30:18 -0700 (PDT) Subject: [Numpy-discussion] numpy input with genfromttxt() Message-ID: <31757790.post@talk.nabble.com> Hello, im actually try to read in data with genfromtxt(), i want to read in numbers which are stored in a textfile like this: 0,000000 0,001221 0,001221 0,000000 1,278076 160,102539 4,000000E-7 0,000000 0,000000 0,002441 1,279297 160,000000 8,000000E-7 -0,001221 0,000000 0,001221 1,279297 159,897461 1,200000E-6 0,000000 0,000000 0,001221 1,279297 160,000000 1,600000E-6 -0,001221 0,000000 0,003662 1,278076 159,897461 2,000000E-6 0,000000 -0,001221 0,003662 1,279297 160,000000 my problem is that they are seperated with a comma so when i try to read them i just get a numpy array with NaN's. is there a short way to replace the "," with "." ? -- View this message in context: http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From sturla at molden.no Fri Jun 3 10:02:53 2011 From: sturla at molden.no (Sturla Molden) Date: Fri, 03 Jun 2011 16:02:53 +0200 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: <4DE8E98D.2000107@molden.no> Den 02.06.2011 21:07, skrev Ralf Gommers: > > After search for multi-taper I found > http://projects.scipy.org/scipy/ticket/608. Too bad no one has gotten > around to review your contribution before, it sounds like a good > addition for the scipy.signal module. I once implemented DPSS tapers in Python (by looking at the equations and code in NR), but not all of Thompson's adaptive multitaper method. I'm not sure if it's too similar to be used or shared. Sturla From shish at keba.be Fri Jun 3 10:10:05 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 3 Jun 2011 10:10:05 -0400 Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <31757790.post@talk.nabble.com> References: <31757790.post@talk.nabble.com> Message-ID: Here's an ugly one-liner: numpy.genfromtxt('data.txt', converters=dict([k, lambda x: float(x.replace(',', '.'))] for k in range(len(open('data.txt').readline().strip().split())))) -=- Olivier 2011/6/3 jgrub > > Hello, im actually try to read in data with genfromtxt(), > i want to read in numbers which are stored in a textfile like this: > > 0,000000 0,001221 0,001221 0,000000 1,278076 > 160,102539 > > 4,000000E-7 0,000000 0,000000 0,002441 1,279297 > 160,000000 > > 8,000000E-7 -0,001221 0,000000 0,001221 1,279297 > 159,897461 > > 1,200000E-6 0,000000 0,000000 0,001221 1,279297 > 160,000000 > > 1,600000E-6 -0,001221 0,000000 0,003662 1,278076 > 159,897461 > > 2,000000E-6 0,000000 -0,001221 0,003662 1,279297 > 160,000000 > > my problem is that they are seperated with a comma so when i try to read > them > i just get a numpy array with NaN's. is there a short way to replace the > "," with "." ? > > -- > View this message in context: > http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gruben at bigpond.net.au Fri Jun 3 10:30:40 2011 From: gruben at bigpond.net.au (gary ruben) Date: Sat, 4 Jun 2011 00:30:40 +1000 Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <31757790.post@talk.nabble.com> References: <31757790.post@talk.nabble.com> Message-ID: I think the easiest is to use an intermediate string. If your file is test.txt, In [22]: a = file('test.txt').read().replace(',','.') In [23]: import StringIO In [24]: b=genfromtxt(StringIO.StringIO(a)) In [25]: b Out[25]: array([[ 0.00000000e+00, 1.22100000e-03, 1.22100000e-03, 0.00000000e+00, 1.27807600e+00, 1.60102539e+02], [ 4.00000000e-07, 0.00000000e+00, 0.00000000e+00, 2.44100000e-03, 1.27929700e+00, 1.60000000e+02], [ 8.00000000e-07, -1.22100000e-03, 0.00000000e+00, 1.22100000e-03, 1.27929700e+00, 1.59897461e+02], [ 1.20000000e-06, 0.00000000e+00, 0.00000000e+00, 1.22100000e-03, 1.27929700e+00, 1.60000000e+02], [ 1.60000000e-06, -1.22100000e-03, 0.00000000e+00, 3.66200000e-03, 1.27807600e+00, 1.59897461e+02], [ 2.00000000e-06, 0.00000000e+00, -1.22100000e-03, 3.66200000e-03, 1.27929700e+00, 1.60000000e+02]]) On Fri, Jun 3, 2011 at 11:30 PM, jgrub wrote: > > Hello, im actually try to read in ?data with genfromtxt(), > i want to read in numbers which are stored in ?a textfile like this: > > 0,000000 ? ? ? ?0,001221 ? ? ? ?0,001221 ? ? ? ?0,000000 ? ? ? ?1,278076 ? ? ? ?160,102539 > > 4,000000E-7 ? ? 0,000000 ? ? ? ?0,000000 ? ? ? ?0,002441 ? ? ? ?1,279297 ? ? ? ?160,000000 > > 8,000000E-7 ? ? -0,001221 ? ? ? 0,000000 ? ? ? ?0,001221 ? ? ? ?1,279297 ? ? ? ?159,897461 > > 1,200000E-6 ? ? 0,000000 ? ? ? ?0,000000 ? ? ? ?0,001221 ? ? ? ?1,279297 ? ? ? ?160,000000 > > 1,600000E-6 ? ? -0,001221 ? ? ? 0,000000 ? ? ? ?0,003662 ? ? ? ?1,278076 ? ? ? ?159,897461 > > 2,000000E-6 ? ? 0,000000 ? ? ? ?-0,001221 ? ? ? 0,003662 ? ? ? ?1,279297 ? ? ? ?160,000000 > > my problem is that they are seperated with a comma so when i try to read > them > i just get a numpy array with NaN's. ?is there a short way to replace the > "," with "." ?? > > -- > View this message in context: http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From jonas.ruebsam at web.de Fri Jun 3 10:32:04 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 3 Jun 2011 07:32:04 -0700 (PDT) Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: References: <31757790.post@talk.nabble.com> Message-ID: <31765860.post@talk.nabble.com> thank you very much, works quit nice and faster in comparison to the script i wrote and used before , im not that much used to lambda forms but it seems quit usefull in situations like this Olivier Delalleau-2 wrote: > > Here's an ugly one-liner: > > numpy.genfromtxt('data.txt', converters=dict([k, lambda x: > float(x.replace(',', '.'))] for k in > range(len(open('data.txt').readline().strip().split())))) > > -=- Olivier > > 2011/6/3 jgrub > >> >> Hello, im actually try to read in data with genfromtxt(), >> i want to read in numbers which are stored in a textfile like this: >> >> 0,000000 0,001221 0,001221 0,000000 1,278076 >> 160,102539 >> >> 4,000000E-7 0,000000 0,000000 0,002441 1,279297 >> 160,000000 >> >> 8,000000E-7 -0,001221 0,000000 0,001221 1,279297 >> 159,897461 >> >> 1,200000E-6 0,000000 0,000000 0,001221 1,279297 >> 160,000000 >> >> 1,600000E-6 -0,001221 0,000000 0,003662 1,278076 >> 159,897461 >> >> 2,000000E-6 0,000000 -0,001221 0,003662 1,279297 >> 160,000000 >> >> my problem is that they are seperated with a comma so when i try to read >> them >> i just get a numpy array with NaN's. is there a short way to replace the >> "," with "." ? >> >> -- >> View this message in context: >> http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html >> Sent from the Numpy-discussion mailing list archive at Nabble.com. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- View this message in context: http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31765860.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From jonas.ruebsam at web.de Fri Jun 3 10:33:11 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 3 Jun 2011 07:33:11 -0700 (PDT) Subject: [Numpy-discussion] numpy input with genfromttxt() Message-ID: <31765860.post@talk.nabble.com> thank you very much, works much nicer and faster in comparison to the script i wrote and used before , im not that much used to lambda forms but it seems quit usefull in situations like this Olivier Delalleau-2 wrote: > > Here's an ugly one-liner: > > numpy.genfromtxt('data.txt', converters=dict([k, lambda x: > float(x.replace(',', '.'))] for k in > range(len(open('data.txt').readline().strip().split())))) > > -=- Olivier > > 2011/6/3 jgrub > >> >> Hello, im actually try to read in data with genfromtxt(), >> i want to read in numbers which are stored in a textfile like this: >> >> 0,000000 0,001221 0,001221 0,000000 1,278076 >> 160,102539 >> >> 4,000000E-7 0,000000 0,000000 0,002441 1,279297 >> 160,000000 >> >> 8,000000E-7 -0,001221 0,000000 0,001221 1,279297 >> 159,897461 >> >> 1,200000E-6 0,000000 0,000000 0,001221 1,279297 >> 160,000000 >> >> 1,600000E-6 -0,001221 0,000000 0,003662 1,278076 >> 159,897461 >> >> 2,000000E-6 0,000000 -0,001221 0,003662 1,279297 >> 160,000000 >> >> my problem is that they are seperated with a comma so when i try to read >> them >> i just get a numpy array with NaN's. is there a short way to replace the >> "," with "." ? >> >> -- >> View this message in context: >> http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html >> Sent from the Numpy-discussion mailing list archive at Nabble.com. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- View this message in context: http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31765860.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From charlesr.harris at gmail.com Fri Jun 3 10:38:42 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 3 Jun 2011 08:38:42 -0600 Subject: [Numpy-discussion] [Repost] Adding the arrays returned in an array iterator In-Reply-To: References: Message-ID: On Thu, Jun 2, 2011 at 7:36 PM, srean wrote: > Bumping my question tentatively. I am fairly sure there is a good answer > and for some reason it got overlooked. > > Regards > srean > > ---------- Forwarded message ---------- > From: srean > Date: Fri, May 27, 2011 at 10:36 AM > Subject: Adding the arrays in an array iterator > To: Discussion of Numerical Python > > > Hi List, > > I have to sum up an unknown number of ndarrays of the same size. These > arrays, possibly thousands in number, are provided via an iterator. Right > now I use python reduce with operator.add. > > Does that invoke the corresponding ufunc internally ? I want to avoid > creating temporaries, which I suspect a naive invocation of reduce will > create. With ufunc I know how to avoid making copies using the output > parameter, but not sure how to make use of that in reduce. > > It is not essential that I use reduce though, so I would welcome idiomatic > and efficient way of executing this. So far I have stayed away from building > an ndarray object and summing across the relevant dimension. Is that what I > should be doing ? Different invocations of this function has different > number of arrays, so I cannot pre-compile away this into a numexpr. > > There are several ways to do this depending on how big the arrays are, how many there are, and how much memory you have. If you have all the arrays as slices in a cube with the arrays numbered by the first index, then you can do >>> cube.sum(0) If they are in a list, then I would do something like result = arrays[0].copy() for a in arrays[1:]: result += a But much depends on the details of your problem. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Fri Jun 3 11:01:09 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 03 Jun 2011 10:01:09 -0500 Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <31765860.post@talk.nabble.com> References: <31765860.post@talk.nabble.com> Message-ID: <4DE8F735.2020908@gmail.com> On 06/03/2011 09:33 AM, jonasr wrote: > thank you very much, works much nicer and faster in comparison to the script > i wrote and used before , > im not that much used to lambda forms but it seems quit usefull in > situations like this > > > Olivier Delalleau-2 wrote: >> Here's an ugly one-liner: >> >> numpy.genfromtxt('data.txt', converters=dict([k, lambda x: >> float(x.replace(',', '.'))] for k in >> range(len(open('data.txt').readline().strip().split())))) >> >> -=- Olivier >> >> 2011/6/3 jgrub >> >>> Hello, im actually try to read in data with genfromtxt(), >>> i want to read in numbers which are stored in a textfile like this: >>> >>> 0,000000 0,001221 0,001221 0,000000 1,278076 >>> 160,102539 >>> >>> 4,000000E-7 0,000000 0,000000 0,002441 1,279297 >>> 160,000000 >>> >>> 8,000000E-7 -0,001221 0,000000 0,001221 1,279297 >>> 159,897461 >>> >>> 1,200000E-6 0,000000 0,000000 0,001221 1,279297 >>> 160,000000 >>> >>> 1,600000E-6 -0,001221 0,000000 0,003662 1,278076 >>> 159,897461 >>> >>> 2,000000E-6 0,000000 -0,001221 0,003662 1,279297 >>> 160,000000 >>> >>> my problem is that they are seperated with a comma so when i try to read >>> them >>> i just get a numpy array with NaN's. is there a short way to replace the >>> "," with "." ? >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html >>> Sent from the Numpy-discussion mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> Isn't this just because of the 'locale' settings? A quick search showed ticket 884 that has code changing the locale that may be useful for you: http://projects.scipy.org/numpy/ticket/884 Perhaps a similar bug exists with genfromtxt? If it is nicely behaved, just use Python's csv module. From the csv documentation: "Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding() ). To decode a file using a different encoding, use the encoding argument of open:" Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonas.ruebsam at web.de Fri Jun 3 11:33:29 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 3 Jun 2011 08:33:29 -0700 (PDT) Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <4DE8F735.2020908@gmail.com> References: <31757790.post@talk.nabble.com> <31765860.post@talk.nabble.com> <4DE8F735.2020908@gmail.com> Message-ID: <31766315.post@talk.nabble.com> Bruce Southey wrote: > > On 06/03/2011 09:33 AM, jonasr wrote: >> thank you very much, works much nicer and faster in comparison to the >> script >> i wrote and used before , >> im not that much used to lambda forms but it seems quit usefull in >> situations like this >> >> >> Olivier Delalleau-2 wrote: >>> Here's an ugly one-liner: >>> >>> numpy.genfromtxt('data.txt', converters=dict([k, lambda x: >>> float(x.replace(',', '.'))] for k in >>> range(len(open('data.txt').readline().strip().split())))) >>> >>> -=- Olivier >>> >>> 2011/6/3 jgrub >>> >>>> Hello, im actually try to read in data with genfromtxt(), >>>> i want to read in numbers which are stored in a textfile like this: >>>> >>>> 0,000000 0,001221 0,001221 0,000000 >>>> 1,278076 >>>> 160,102539 >>>> >>>> 4,000000E-7 0,000000 0,000000 0,002441 >>>> 1,279297 >>>> 160,000000 >>>> >>>> 8,000000E-7 -0,001221 0,000000 0,001221 >>>> 1,279297 >>>> 159,897461 >>>> >>>> 1,200000E-6 0,000000 0,000000 0,001221 >>>> 1,279297 >>>> 160,000000 >>>> >>>> 1,600000E-6 -0,001221 0,000000 0,003662 >>>> 1,278076 >>>> 159,897461 >>>> >>>> 2,000000E-6 0,000000 -0,001221 0,003662 >>>> 1,279297 >>>> 160,000000 >>>> >>>> my problem is that they are seperated with a comma so when i try to >>>> read >>>> them >>>> i just get a numpy array with NaN's. is there a short way to replace >>>> the >>>> "," with "." ? >>>> >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html >>>> Sent from the Numpy-discussion mailing list archive at Nabble.com. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> > Isn't this just because of the 'locale' settings? > A quick search showed ticket 884 that has code changing the locale that > may be useful for you: > http://projects.scipy.org/numpy/ticket/884 > > Perhaps a similar bug exists with genfromtxt? > > If it is nicely behaved, just use Python's csv module. > From the csv documentation: > "Since open() > is > used to open a CSV file for reading, the file will by default be decoded > into unicode using the system default encoding (see > locale.getpreferredencoding() > ). > To decode a file using a different encoding, use the encoding argument > of open:" > > > Bruce > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > i checked my local settings and got In [37]: locale.getlocale() Out[37]: ('de_DE', 'UTF8') when i try locale.setlocale(locale.LC_NUMERIC, 'de_DE') i get: Error: unsupported locale setting same with 'fi_FI' any idea ? -- View this message in context: http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31766315.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From ralf.gommers at googlemail.com Fri Jun 3 11:36:20 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Fri, 3 Jun 2011 17:36:20 +0200 Subject: [Numpy-discussion] numpy easy_install fails for python 3.2 In-Reply-To: References: Message-ID: On Wed, May 4, 2011 at 8:51 PM, Matthew Brett wrote: > Hi, > > On Wed, May 4, 2011 at 1:23 PM, Ralf Gommers > wrote: > > > > > > On Wed, May 4, 2011 at 6:53 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> I can imagine that this is low-priority, but I have just been enjoying > >> pytox for automated virtualenv testing: > >> > >> http://codespeak.net/tox/index.html > >> > >> which revealed that numpy download-build-install via easy_install > >> (distribute) fails with the appended traceback ending in "ValueError: > >> 'build/py3k/numpy' is not a directory". > > > > I think it would be good to just say "wontfix" immediately, rather than > just > > leaving a ticket open and not do anything (like we did with > > http://projects.scipy.org/numpy/ticket/860). > > Ouch - yes - I see what you mean. > > > It seems tox can also use pip (which works with py3k now), does that work > > for you? > > I think current tox 0.9 uses virtualenv5 for python3.2 and has to use > distribute, I believe. Current tip of pytox appears to use virtualenv > 1.6.1 for python 3.2, and does use pip, but generates the same error > in the end. > > I've appended the result of a fresh python3.2 virtualenv and a "pip > install numpy". > > Sorry - I know these are not fun problems, > Not too much fun indeed. Helped me become better friends with pdb though. I opened http://projects.scipy.org/numpy/ticket/1857 and explained the issue as far as I understand it. Summary: 1. There's a bug in Configuration.__init__ where dirname(abspath(nonexistingfile)) gives curdir. 1. The bug in (1) happens to work out well in combination with the os.chdir call in the main setup.py. 1. Once we correct for this so that the bug from (1) does not affect install under pip + virtualenv, we run into another issue due to both numpy.distutils and pip messing with __file__ in ugly and incompatible ways. Even if the "python setup.py egg_info" command in pip can be fixed, there may well be more issues to run into. I don't think I'm going to spend more time on this. If someone wants a challenge and can wrap his brain around things like: caller_file = eval('__file__', frame.f_globals, frame.f_locals) have fun! Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Fri Jun 3 11:54:42 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 03 Jun 2011 10:54:42 -0500 Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <31766315.post@talk.nabble.com> References: <31757790.post@talk.nabble.com> <31765860.post@talk.nabble.com> <4DE8F735.2020908@gmail.com> <31766315.post@talk.nabble.com> Message-ID: <4DE903C2.5090106@gmail.com> On 06/03/2011 10:33 AM, jonasr wrote: > > > Bruce Southey wrote: >> On 06/03/2011 09:33 AM, jonasr wrote: >>> thank you very much, works much nicer and faster in comparison to the >>> script >>> i wrote and used before , >>> im not that much used to lambda forms but it seems quit usefull in >>> situations like this >>> >>> >>> Olivier Delalleau-2 wrote: >>>> Here's an ugly one-liner: >>>> >>>> numpy.genfromtxt('data.txt', converters=dict([k, lambda x: >>>> float(x.replace(',', '.'))] for k in >>>> range(len(open('data.txt').readline().strip().split())))) >>>> >>>> -=- Olivier >>>> >>>> 2011/6/3 jgrub >>>> >>>>> Hello, im actually try to read in data with genfromtxt(), >>>>> i want to read in numbers which are stored in a textfile like this: >>>>> >>>>> 0,000000 0,001221 0,001221 0,000000 >>>>> 1,278076 >>>>> 160,102539 >>>>> >>>>> 4,000000E-7 0,000000 0,000000 0,002441 >>>>> 1,279297 >>>>> 160,000000 >>>>> >>>>> 8,000000E-7 -0,001221 0,000000 0,001221 >>>>> 1,279297 >>>>> 159,897461 >>>>> >>>>> 1,200000E-6 0,000000 0,000000 0,001221 >>>>> 1,279297 >>>>> 160,000000 >>>>> >>>>> 1,600000E-6 -0,001221 0,000000 0,003662 >>>>> 1,278076 >>>>> 159,897461 >>>>> >>>>> 2,000000E-6 0,000000 -0,001221 0,003662 >>>>> 1,279297 >>>>> 160,000000 >>>>> >>>>> my problem is that they are seperated with a comma so when i try to >>>>> read >>>>> them >>>>> i just get a numpy array with NaN's. is there a short way to replace >>>>> the >>>>> "," with "." ? >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html >>>>> Sent from the Numpy-discussion mailing list archive at Nabble.com. >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion at scipy.org >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >> Isn't this just because of the 'locale' settings? >> A quick search showed ticket 884 that has code changing the locale that >> may be useful for you: >> http://projects.scipy.org/numpy/ticket/884 >> >> Perhaps a similar bug exists with genfromtxt? >> >> If it is nicely behaved, just use Python's csv module. >> From the csv documentation: >> "Since open() >> is >> used to open a CSV file for reading, the file will by default be decoded >> into unicode using the system default encoding (see >> locale.getpreferredencoding() >> ). >> To decode a file using a different encoding, use the encoding argument >> of open:" >> >> >> Bruce >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > i checked my local settings and got > > In [37]: locale.getlocale() > Out[37]: ('de_DE', 'UTF8') > > when i try > > locale.setlocale(locale.LC_NUMERIC, 'de_DE') > > i get: Error: unsupported locale setting > same with 'fi_FI' > any idea ? > > > > You don't seem to be using Python. The 'In [37]' suggest ipython so you have to find how ipython does this. However, I am not sure that changing locale will help here but I am rather naive on this. If it doesn't then file a ticket with a clear example and ideally with a patch (one can hope). Bruce From shish at keba.be Fri Jun 3 13:36:23 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 3 Jun 2011 13:36:23 -0400 Subject: [Numpy-discussion] numpy input with genfromttxt() In-Reply-To: <4DE903C2.5090106@gmail.com> References: <31757790.post@talk.nabble.com> <31765860.post@talk.nabble.com> <4DE8F735.2020908@gmail.com> <31766315.post@talk.nabble.com> <4DE903C2.5090106@gmail.com> Message-ID: 2011/6/3 Bruce Southey > On 06/03/2011 10:33 AM, jonasr wrote: > > > > > > Bruce Southey wrote: > >> On 06/03/2011 09:33 AM, jonasr wrote: > >>> thank you very much, works much nicer and faster in comparison to the > >>> script > >>> i wrote and used before , > >>> im not that much used to lambda forms but it seems quit usefull in > >>> situations like this > >>> > >>> > >>> Olivier Delalleau-2 wrote: > >>>> Here's an ugly one-liner: > >>>> > >>>> numpy.genfromtxt('data.txt', converters=dict([k, lambda x: > >>>> float(x.replace(',', '.'))] for k in > >>>> range(len(open('data.txt').readline().strip().split())))) > >>>> > >>>> -=- Olivier > >>>> > >>>> 2011/6/3 jgrub > >>>> > >>>>> Hello, im actually try to read in data with genfromtxt(), > >>>>> i want to read in numbers which are stored in a textfile like this: > >>>>> > >>>>> 0,000000 0,001221 0,001221 0,000000 > >>>>> 1,278076 > >>>>> 160,102539 > >>>>> > >>>>> 4,000000E-7 0,000000 0,000000 0,002441 > >>>>> 1,279297 > >>>>> 160,000000 > >>>>> > >>>>> 8,000000E-7 -0,001221 0,000000 0,001221 > >>>>> 1,279297 > >>>>> 159,897461 > >>>>> > >>>>> 1,200000E-6 0,000000 0,000000 0,001221 > >>>>> 1,279297 > >>>>> 160,000000 > >>>>> > >>>>> 1,600000E-6 -0,001221 0,000000 0,003662 > >>>>> 1,278076 > >>>>> 159,897461 > >>>>> > >>>>> 2,000000E-6 0,000000 -0,001221 0,003662 > >>>>> 1,279297 > >>>>> 160,000000 > >>>>> > >>>>> my problem is that they are seperated with a comma so when i try to > >>>>> read > >>>>> them > >>>>> i just get a numpy array with NaN's. is there a short way to replace > >>>>> the > >>>>> "," with "." ? > >>>>> > >>>>> -- > >>>>> View this message in context: > >>>>> > http://old.nabble.com/numpy-input-with-genfromttxt%28%29-tp31757790p31757790.html > >>>>> Sent from the Numpy-discussion mailing list archive at Nabble.com. > >>>>> > >>>>> _______________________________________________ > >>>>> NumPy-Discussion mailing list > >>>>> NumPy-Discussion at scipy.org > >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>>>> > >>>> _______________________________________________ > >>>> NumPy-Discussion mailing list > >>>> NumPy-Discussion at scipy.org > >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>>> > >>>> > >> Isn't this just because of the 'locale' settings? > >> A quick search showed ticket 884 that has code changing the locale that > >> may be useful for you: > >> http://projects.scipy.org/numpy/ticket/884 > >> > >> Perhaps a similar bug exists with genfromtxt? > >> > >> If it is nicely behaved, just use Python's csv module. > >> From the csv documentation: > >> "Since open() > >> is > >> used to open a CSV file for reading, the file will by default be decoded > >> into unicode using the system default encoding (see > >> locale.getpreferredencoding() > >> < > http://docs.python.org/release/3.1.3/library/locale.html#locale.getpreferredencoding > >). > >> To decode a file using a different encoding, use the encoding argument > >> of open:" > >> > >> > >> Bruce > >> > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > i checked my local settings and got > > > > In [37]: locale.getlocale() > > Out[37]: ('de_DE', 'UTF8') > > > > when i try > > > > locale.setlocale(locale.LC_NUMERIC, 'de_DE') > > > > i get: Error: unsupported locale setting > > same with 'fi_FI' > > any idea ? > > > > > > > > > You don't seem to be using Python. > The 'In [37]' suggest ipython so you have to find how ipython does this. > > However, I am not sure that changing locale will help here but I am > rather naive on this. If it doesn't then file a ticket with a clear > example and ideally with a patch (one can hope). > > Bruce > Should work just the same in iPython. Works fine with me in iPython with Python 2.5.1. -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Fri Jun 3 14:46:50 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 03 Jun 2011 14:46:50 -0400 Subject: [Numpy-discussion] reinterpret complex <-> float Message-ID: Can I arrange to reinterpret an array of complex of length N as an array of float of length 2N, and vice-versa? If so, how? From robert.kern at gmail.com Fri Jun 3 14:55:36 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 3 Jun 2011 13:55:36 -0500 Subject: [Numpy-discussion] reinterpret complex <-> float In-Reply-To: References: Message-ID: On Fri, Jun 3, 2011 at 13:46, Neal Becker wrote: > Can I arrange to reinterpret an array of complex of length N as an array of > float of length 2N, and vice-versa? ?If so, how? [~] |1> carr = np.arange(10, dtype=complex) [~] |2> carr array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j, 7.+0.j, 8.+0.j, 9.+0.j]) [~] |3> carr.view(float) array([ 0., 0., 1., 0., 2., 0., 3., 0., 4., 0., 5., 0., 6., 0., 7., 0., 8., 0., 9., 0.]) [~] |4> _.view(complex) array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j, 7.+0.j, 8.+0.j, 9.+0.j]) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From warren.weckesser at enthought.com Fri Jun 3 14:56:46 2011 From: warren.weckesser at enthought.com (Warren Weckesser) Date: Fri, 3 Jun 2011 13:56:46 -0500 Subject: [Numpy-discussion] reinterpret complex <-> float In-Reply-To: References: Message-ID: On Fri, Jun 3, 2011 at 1:46 PM, Neal Becker wrote: > Can I arrange to reinterpret an array of complex of length N as an array of > float of length 2N, and vice-versa? If so, how? > You can use the view() method (if the data is contiguous): In [14]: z = array([1.0+1j, 2.0+3j, 4, 9j]) In [15]: x = z.view(float64) In [16]: x Out[16]: array([ 1., 1., 2., 3., 4., 0., 0., 9.]) But this won't work if the data in the array is not contiguous: In [17]: z2 = z[::2] In [18]: z2 Out[18]: array([ 1.+1.j, 4.+0.j]) In [19]: z2.view(float64) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/warren/ in () ValueError: new type not compatible with array. (I'm using ipython with the -pylab option; 'array' and 'float64' are from numpy.) Warren > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.graves at gmail.com Fri Jun 3 18:14:56 2011 From: christoph.graves at gmail.com (cgraves) Date: Fri, 3 Jun 2011 15:14:56 -0700 (PDT) Subject: [Numpy-discussion] loadtxt ndmin option In-Reply-To: References: <2F4F21FC-6F70-45B3-8417-B256F3A167AB@gmail.com> <5D10525C-839D-4AFC-8F48-72A2C01D7217@astro.physik.uni-goettingen.de> <40DA9A71-A1B7-4F0E-A013-925C2414155C@astro.physik.uni-goettingen.de> <38117E8E-ADDB-4B1F-A797-04A6D4B6DEF4@gmail.com> <0D212BB0-437C-4312-9CE1-BF982F3BAC40@astro.physik.uni-goettingen.de> <31740152.post@talk.nabble.com> Message-ID: <31769169.post@talk.nabble.com> Derek Homeier wrote: > > Hi Chris, > > On 31 May 2011, at 13:56, cgraves wrote: > >> I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin >> option, >> however neither genfromtxt nor recfromtxt, which use loadtxt, have it. >> Should they have inherited the option? Who can make it happen? > > you are mistaken, genfromtxt is not using loadtxt (and could not > possibly, since it has the more complex parser to handle missing > data); thus ndmin could not be inherited automatically. > It certainly would make sense to provide the same functionality for > genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so > I'd go ahead and file a feature (enhancement) request. I can't promise > I can take care of it myself, as I am less familiar with genfromtxt, > but I'd certainly have a look at it. > > Does anyone have an opinion whether this is a case for reopening (yet > again...) > http://projects.scipy.org/numpy/ticket/1562 > or create a new ticket? > Thanks Derek. That would be greatly appreciated! Based on the follow-up messages in this thread, it looks like (hopefully) there will not be too much additional work in implementing it. For now I'll just use the temporary fix, a .reshape(-1), on any recfromtxt's that might read in a single row of data.. Kind regards, Chris -- View this message in context: http://old.nabble.com/loadtxt-savetxt-tickets-tp31238871p31769169.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From srean.list at gmail.com Fri Jun 3 21:25:57 2011 From: srean.list at gmail.com (srean) Date: Fri, 3 Jun 2011 20:25:57 -0500 Subject: [Numpy-discussion] [Repost] Adding the arrays returned in an array iterator In-Reply-To: References: Message-ID: > If they are in a list, then I would do something like Apologies if it wasnt clear in my previous mail. The arrays are in a lazy iterator, they are non-contiguous and there are several thousands of them. I was hoping there was a way to get at a "+=" operator for arrays to use in a reduce. Seems like indeed there is. I had missed operator.iadd() > result = arrays[0].copy() > for a in arrays[1:]: > result += a > > But much depends on the details of your problem. > Chuck From akshar.bhosale at gmail.com Fri Jun 3 17:05:21 2011 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Sat, 4 Jun 2011 02:35:21 +0530 Subject: [Numpy-discussion] error when numpy is imported Message-ID: [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy Traceback (most recent call last): File "", line 1, in File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/__init__.py", line 137, in import add_newdocs File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/add_newdocs.py", line 9, in from numpy.lib import add_newdoc File "/homeadmin/numpy-1.6.0b2/install/lib/python/numpy/lib/__init__.py", line 13, in from polynomial import * File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/lib/polynomial.py", line 17, in from numpy.linalg import eigvals, lstsq File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/linalg/__init__.py", line 48, in from linalg import * File "/home/dmin/numpy-1.6.0b2/install/lib/python/numpy/linalg/linalg.py", line 23, in from numpy.linalg import lapack_lite ImportError: /home/external/Python-2.6/lib/python2.6/site-packages/numpy/linalg/lapack_lite.so: undefined symbol: _gfortran_concat_string We have intel mkl with which we are trying to build numpy and scipy. Kinly help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sat Jun 4 12:09:25 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sat, 4 Jun 2011 18:09:25 +0200 Subject: [Numpy-discussion] error when numpy is imported In-Reply-To: References: Message-ID: On Fri, Jun 3, 2011 at 11:05 PM, akshar bhosale wrote: > [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy > Traceback (most recent call last): > File "", line 1, in > File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/__init__.py", > line 137, in > import add_newdocs > File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/add_newdocs.py", > line 9, in > from numpy.lib import add_newdoc > File "/homeadmin/numpy-1.6.0b2/install/lib/python/numpy/lib/__init__.py", > line 13, in > from polynomial import * > File > "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/lib/polynomial.py", line > 17, in > from numpy.linalg import eigvals, lstsq > File > "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/linalg/__init__.py", > line 48, in > from linalg import * > File > "/home/dmin/numpy-1.6.0b2/install/lib/python/numpy/linalg/linalg.py", line > 23, in > from numpy.linalg import lapack_lite > ImportError: > /home/external/Python-2.6/lib/python2.6/site-packages/numpy/linalg/lapack_lite.so: > undefined symbol: _gfortran_concat_string > > We have intel mkl with which we are trying to build numpy and scipy. > > It looks like you are mixing different fortran compilers, which doesn't work. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 4 14:34:37 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 4 Jun 2011 12:34:37 -0600 Subject: [Numpy-discussion] _sort module. Message-ID: Hi All, Numpy has a _sort module which contains *no* methods and whose only purpose is to modify the type descriptors by adding pointers to the sorting functions when it is loaded. Consequently _sort is imported early in numpy/core/__init__.py and is otherwise unused. This really doesn't seem right. I would like to make the sorting functions into a library or at least link them directly as part of the build process. I'd also like to move them into their own directory and break them up into several files in the process of adding more functions for order statistics. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 5 16:43:45 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 5 Jun 2011 22:43:45 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: > On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >> > Would anyone object to, at least temporarily, tightening up the >> default >> > ufunc casting rule to 'same_kind' in NumPy master? It's a one line >> change, >> > so would be easy to undo, but such a change is very desirable in my >> > opinion. >> > This would raise an exception, since it's np.add(a, 1.9, out=a), >> > converting a float to an int: >> >> > >>> a = np.arange(3, dtype=np.int32) >> >> > >>> a += 1.9 >> >> That's probably going to break a huge amount of code which relies on the >> current behavior. >> >> Am I right in believing that this should only be considered for a major >> release of numpy, say numpy 2.0? > > > Absolutely, and that's why I'm proposing to do it in master now, fairly > early in a development cycle, so we can evaluate its effects. If the next > version is 1.7, we probably would roll it back for release (a 1 line > change), and if the next version is 2.0, we probably would keep it in. > > I suspect at least some of the code relying on the current behavior may > have bugs, and tightening this up is a way to reveal them. > > Here are some results of testing your tighten_casting branch on a few projects - no need to first put it in master first to do that. Four failures in numpy, two in scipy, four in scikit-learn (plus two that don't look related), none in scikits.statsmodels. I didn't check how many of them are actual bugs. I'm not against trying out your change, but it would probably be good to do some more testing first and fix the issues found before putting it in. Then at least if people run into issues with the already tested packages, you can just tell them to update those to latest master. Cheers, Ralf NUMPY ====================================================================== ERROR: test_ones_like (test_numeric.TestLikeFuncs) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rgommers/Code/numpy/numpy/core/tests/test_numeric.py", line 1257, in test_ones_like self.check_like_function(np.ones_like, 1) File "/Users/rgommers/Code/numpy/numpy/core/tests/test_numeric.py", line 1200, in check_like_function dz = like_function(d, dtype=dtype) TypeError: found a loop for ufunc 'ones_like' matching the type-tuple, but the inputs and/or outputs could not be cast according to the casting rule ====================================================================== ERROR: test_methods_with_output (test_core.TestMaskedArrayArithmetic) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1109, in test_methods_with_output result = xmmeth(axis=0, out=output) File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 4774, in std out **= 0.5 File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3765, in __ipow__ ndarray.__ipow__(self._data, np.where(self._mask, 1, other_data)) TypeError: ufunc 'power' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ====================================================================== ERROR: Check that we don't shrink a mask when not wanted ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1044, in test_noshrinking a /= 1. File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3725, in __idiv__ ndarray.__idiv__(self._data, np.where(self._mask, 1, other_data)) TypeError: ufunc 'divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ====================================================================== ERROR: Test of inplace additions ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1651, in test_inplace_addition_array x += a File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3686, in __iadd__ ndarray.__iadd__(self._data, np.where(self._mask, 0, getdata(other))) TypeError: ufunc 'add' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' SCIPY ====================================================================== ERROR: gaussian gradient magnitude filter 1 ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", line 187, in runTest self.test(*self.arg) File "/Users/rgommers/Code/scipy/scipy/ndimage/tests/test_ndimage.py", line 670, in test_gaussian_gradient_magnitude01 1.0) File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 479, in gaussian_gradient_magnitude cval, extra_arguments = (sigma,)) File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 450, in generic_gradient_magnitude numpy.sqrt(output, output) TypeError: ufunc 'sqrt' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ====================================================================== ERROR: gaussian gradient magnitude filter 2 ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", line 187, in runTest self.test(*self.arg) File "/Users/rgommers/Code/scipy/scipy/ndimage/tests/test_ndimage.py", line 685, in test_gaussian_gradient_magnitude02 output) File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 479, in gaussian_gradient_magnitude cval, extra_arguments = (sigma,)) File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 450, in generic_gradient_magnitude numpy.sqrt(output, output) TypeError: ufunc 'sqrt' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' SCIKIT-LEARN ====================================================================== ERROR: Test decision_function ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", line 187, in runTest self.test(*self.arg) File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/tests/test_svm.py", line 254, in test_decision_function assert_array_almost_equal(-dec, np.ravel(clf.decision_function(data))) File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/base.py", line 276, in decision_function **self._get_params()) File "libsvm.pyx", line 385, in scikits.learn.svm.libsvm.decision_function (scikits/learn/svm/libsvm.c:4748) ValueError: ndarray is not C-contiguous ====================================================================== ERROR: Test weights on individual samples ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", line 187, in runTest self.test(*self.arg) File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/tests/test_svm.py", line 282, in test_sample_weights assert_array_equal(clf.predict(X[2]), [1.]) File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/base.py", line 186, in predict svm_type=svm_type, **self._get_params()) File "libsvm.pyx", line 224, in scikits.learn.svm.libsvm.predict (scikits/learn/svm/libsvm.c:3133) ValueError: ndarray is not C-contiguous ====================================================================== FAIL: Doctest: scikits.learn.decomposition.pca.PCA ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 2145, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for scikits.learn.decomposition.pca.PCA File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 94, in PCA ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 161, in scikits.learn.decomposition.pca.PCA Failed example: pca.fit(X) Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in pca.fit(X) File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 191, in fit self._fit(X, **params) File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 227, in _fit X -= self.mean_ TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 163, in scikits.learn.decomposition.pca.PCA Failed example: print pca.explained_variance_ratio_ Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in print pca.explained_variance_ratio_ AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_' >> raise self.failureException(self.format_failure(.getvalue())) ====================================================================== FAIL: Doctest: scikits.learn.decomposition.pca.ProbabilisticPCA ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 2145, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for scikits.learn.decomposition.pca.ProbabilisticPCA File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 273, in ProbabilisticPCA ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 342, in scikits.learn.decomposition.pca.ProbabilisticPCA Failed example: pca.fit(X) Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in pca.fit(X) File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 191, in fit self._fit(X, **params) File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 227, in _fit X -= self.mean_ TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 344, in scikits.learn.decomposition.pca.ProbabilisticPCA Failed example: print pca.explained_variance_ratio_ Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in print pca.explained_variance_ratio_ AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_' >> raise self.failureException(self.format_failure(.getvalue())) ====================================================================== FAIL: Doctest: scikits.learn.decomposition.pca.RandomizedPCA ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 2145, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for scikits.learn.decomposition.pca.RandomizedPCA File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 330, in RandomizedPCA ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 377, in scikits.learn.decomposition.pca.RandomizedPCA Failed example: pca.fit(X) Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in pca.fit(X) File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 438, in fit X -= self.mean_ TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line 379, in scikits.learn.decomposition.pca.RandomizedPCA Failed example: print pca.explained_variance_ratio_ Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in print pca.explained_variance_ratio_ AttributeError: 'RandomizedPCA' object has no attribute 'explained_variance_ratio_' >> raise self.failureException(self.format_failure(.getvalue())) ====================================================================== FAIL: Doctest: scikits.learn.mixture.GMM ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 2145, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for scikits.learn.mixture.GMM File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line 127, in GMM ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line 226, in scikits.learn.mixture.GMM Failed example: g.fit(20 * [[0]] + 20 * [[10]]) Exception raised: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1241, in __run compileflags, 1) in test.globs File "", line 1, in g.fit(20 * [[0]] + 20 * [[10]]) File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line 480, in fit k=self._n_states).fit(X).cluster_centers_ File "/Users/rgommers/Code/scikit-learn/scikits/learn/cluster/k_means_.py", line 432, in fit tol=self.tol, random_state=self.random_state, copy_x=self.copy_x) File "/Users/rgommers/Code/scikit-learn/scikits/learn/cluster/k_means_.py", line 199, in k_means X -= Xmean TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule 'same_kind' ---------------------------------------------------------------------- File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line 228, in scikits.learn.mixture.GMM Failed example: np.round(g.weights, 2) Expected: array([ 0.5, 0.5]) Got: array([ 0.25, 0.75]) >> raise self.failureException(self.format_failure(.getvalue())) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sun Jun 5 17:15:22 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 5 Jun 2011 23:15:22 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: <20110605211522.GB25957@phare.normalesup.org> On Sun, Jun 05, 2011 at 10:43:45PM +0200, Ralf Gommers wrote: > four in scikit-learn (plus two that don't > look related), Yeah, some of these failures are due to numerical unstabilities in tests (nasty ones, still fighting) and some simply to crappy code (HMMs are hopeless :( ). Now, with regards to the actual failures induced by the new branch, it took me a while to understand why they where happening, and now I realise that we probably should have explicit coercions at these locations. Ga?l From gmane at blindgoat.org Sun Jun 5 17:46:27 2011 From: gmane at blindgoat.org (martin smith) Date: Sun, 05 Jun 2011 17:46:27 -0400 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: On 6/3/2011 5:15 AM, Ralf Gommers wrote: > > > On Thu, Jun 2, 2011 at 9:33 PM, Robert Kern > wrote: > > On Thu, Jun 2, 2011 at 14:07, Ralf Gommers > > > wrote: > > > > On Wed, Jun 1, 2011 at 5:45 PM, martin smith > wrote: > >> > >> I have a bit of code that performs multi-taper power spectra > using numpy > >> and a C extension module. The C portion consists of an interface > file > >> and a python-unaware computational file. The latter invokes fftpack. > >> > >> The straightforward setup.py appended below works fine on Linux. On > >> Windows using MinGW it's not so good. I first created a libpython27.a > >> file following instructions on the web. The C code now compiles but > >> fails to link since there is no fftpack_lite.so on Windows. I can't > >> find an fftpack_lite.dll (or even the source for fftpack_lite) > and don't > >> know where to turn. > > > > Python dll's on Windows have a .pyd extension, there should be a > > fftpack_lite.pyd file. But you should be able to just leave off the > > extension, so it does the right thing on any platform. > > > > Also, the way you specified library_dirs doesn't work if numpy is > not in > > site-packages, better to use something like: > > library_dirs = [os.path.join(numpy.__path__[0], 'fft')] > > That said, there is no good cross-platform way to link against other > Python extension modules. Please do not try. You will have to include > a copy of the FFTPACK code in your own extension module. > > Coming back to #608, that means there is no chance that the C version > will land in scipy, correct? We're not going to ship two copies of > FFTPACK. So the answer should be "rewrite in Python, if that's too slow > use Cython". > > Ralf > Thank you both for your prompt response and suggestions. I've modified the setup.py file as Ralf suggested and concocted a version of the code (summarized below) that doesn't try to link to existing modules. The formal problem with a python rewrite is that there is some functionality in the package (the adaptive spectrum estimate) that requires solving an iterative non-linear equation at each frequency. A practical problem is that the odds of my rewriting the core code in python in the near future are not good. I've made an include file for the core code from a version of fftpack.c by stripping out the unused functions and making the survivors static. Using this means, of course, that you'd be partly duplicating fftpack from a binary bloat point of view but it leads to a plausible solution, at least. I'm currently having (new) problems compiling on win7 64-bits (msvcr90.dll for 64bits is missing and I haven't had any luck googling it down yet) as well as on winxp 32-bits (my installation of mingw is confused and can't find all of its parts). I'm planning to stick with it a while longer and see if I can get past these bumps. If I succeed I'll reraise the possibility of getting it into scipy. - martin From charlesr.harris at gmail.com Sun Jun 5 18:41:26 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 5 Jun 2011 16:41:26 -0600 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: <20110605211522.GB25957@phare.normalesup.org> References: <20110602200945.GG2554@phare.normalesup.org> <20110605211522.GB25957@phare.normalesup.org> Message-ID: On Sun, Jun 5, 2011 at 3:15 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Sun, Jun 05, 2011 at 10:43:45PM +0200, Ralf Gommers wrote: > > four in scikit-learn (plus two that don't > > look related), > > Yeah, some of these failures are due to numerical unstabilities in tests > (nasty ones, still fighting) and some simply to crappy code (HMMs are > hopeless :( ). > > Now, with regards to the actual failures induced by the new branch, it > took me a while to understand why they where happening, and now I realise > that we probably should have explicit coercions at these locations. > > So the failures in those cases is a good thing? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sun Jun 5 18:51:47 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 6 Jun 2011 00:51:47 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> <20110605211522.GB25957@phare.normalesup.org> Message-ID: <20110605225147.GD25957@phare.normalesup.org> On Sun, Jun 05, 2011 at 04:41:26PM -0600, Charles R Harris wrote: > Now, with regards to the actual failures induced by the new > branch, it took me a while to understand why they where happening, > and now I realise that we probably should have explicit coercions > at these locations. > So the failures in those cases is a good thing? Yes. Gael From josef.pktd at gmail.com Sun Jun 5 20:06:48 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 5 Jun 2011 20:06:48 -0400 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: <20110605225147.GD25957@phare.normalesup.org> References: <20110602200945.GG2554@phare.normalesup.org> <20110605211522.GB25957@phare.normalesup.org> <20110605225147.GD25957@phare.normalesup.org> Message-ID: On Sun, Jun 5, 2011 at 6:51 PM, Gael Varoquaux wrote: > On Sun, Jun 05, 2011 at 04:41:26PM -0600, Charles R Harris wrote: >> ? ? ?Now, with regards to the actual failures induced by the new >> ? ? ?branch, it took me a while to understand why they where happening, >> ? ? ?and now I realise that we probably should have explicit coercions >> ? ? ?at these locations. > >> ? ?So the failures in those cases is a good thing? > > Yes. I remember in stats distributions, there were casting bugs like this and they are nasty to find. assigning a nan to an int has been changed to raise an exception, but inline operation still produces silent results >>> a = np.arange(3, dtype=np.int32) >>> a[0] = np.nan Traceback (most recent call last): File "", line 1, in a[0] = np.nan ValueError: cannot convert float NaN to integer >>> a += np.nan >>> a array([-2147483648, -2147483648, -2147483648]) >>> a = np.arange(3, dtype=np.int32) >>> a *= np.array([1.,1., np.nan]) >>> a array([ 0, 1, -2147483648]) definitely wrong Josef > > Gael > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From ater1980 at gmail.com Mon Jun 6 02:15:51 2011 From: ater1980 at gmail.com (Alex Ter-Sarkissov) Date: Mon, 6 Jun 2011 18:15:51 +1200 Subject: [Numpy-discussion] k maximal elements Message-ID: I have a vector of positive integers length n. Is there a simple (i.e. without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. Something like *sum(x[sum(x,1)>(max(sum(x,1)+min(sum(x,1))))/2,])* but smarter -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdickinson at enthought.com Mon Jun 6 03:16:44 2011 From: mdickinson at enthought.com (Mark Dickinson) Date: Mon, 6 Jun 2011 08:16:44 +0100 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe wrote: > Leap years are easy compared with leap seconds. Leap seconds involve a > hardcoded table of particular leap-seconds that are added or subtracted, and > are specified roughly 6 months in advance of when they happen by > the?International Earth Rotation and Reference Systems Service?(IERS). The > POSIX time_t doesn't store leap seconds, so if you subtract two time_t > values you may get the wrong answer by up to 34 seconds (or 24, I'm not > totally clear on the initial 10 second jump that happened somewhere). Times in the future would be hairy, too. If leap seconds are supported, how many seconds are there in the timedelta: 2015-01-01 00:00:00 - 2010-01-01 00:00:00 ? Is it acceptable for the result of the subtraction like this to change when the leap second table is updated (e.g., to reflect a newly added leap second on June 30th, 2012)? Mark From irwin.zaid at physics.ox.ac.uk Mon Jun 6 05:06:04 2011 From: irwin.zaid at physics.ox.ac.uk (Irwin Zaid) Date: Mon, 6 Jun 2011 10:06:04 +0100 Subject: [Numpy-discussion] Using NumPy iterators in C extension Message-ID: <4DEC987C.6050505@physics.ox.ac.uk> Hi all, I am writing a C extension for NumPy that implements a custom function. I have a minor question about iterators, and their order of iteration, that I would really appreciate some help with. Here goes... My function takes a sequence of N-dimensional input arrays and a single (N+1)-dimensional output array. For each input array, it iterates over the N dimensions of that array and the leading N dimensions of the output array. It stores a result in the trailing dimension of the output array that is cumulative for the entire sequence of input arrays. Crucially, I need to be sure that the input and output elements always match during an iteration, otherwise the output array becomes incorrect. Meaning, is the kth element visited in the output array during a single iteration the same for each input array? So far, I have implemented this function using two iterators created with NpyIter_New(), one for an N-dimensional input array and one for the (N+1)-dimensional output array. I call NpyIter_RemoveAxis() on the output array iterator. I am using NPY_KEEPORDER for the NPY_ORDER flag, which I suspect is not what I want. How should I guarantee that I preserve order? Should I be using a different order flag, like NPY_CORDER? Or should I be creating a multi-array iterator with NpyIter_MultiNew for a single input array and the output array, though I don't see how to do that in my case. Thanks in advance for your help! I can provide more information if something is unclear. Best, Irwin From wesmckinn at gmail.com Mon Jun 6 05:43:21 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 6 Jun 2011 10:43:21 +0100 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Mon, Jun 6, 2011 at 8:16 AM, Mark Dickinson wrote: > On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe wrote: >> Leap years are easy compared with leap seconds. Leap seconds involve a >> hardcoded table of particular leap-seconds that are added or subtracted, and >> are specified roughly 6 months in advance of when they happen by >> the?International Earth Rotation and Reference Systems Service?(IERS). The >> POSIX time_t doesn't store leap seconds, so if you subtract two time_t >> values you may get the wrong answer by up to 34 seconds (or 24, I'm not >> totally clear on the initial 10 second jump that happened somewhere). > > Times in the future would be hairy, too. If leap seconds are > supported, how many seconds are there in the timedelta: > > ? ?2015-01-01 00:00:00 ?- ?2010-01-01 00:00:00 > > ? ?Is it acceptable for the result of the subtraction like this to > change when the leap second table is updated (e.g., to reflect a newly > added leap second on June 30th, 2012)? > > Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Tangential to the technical details being discussed, but my $0.02 on datetime generally: One thing I tried to do in pandas is enable the implementation of custom time deltas ("date offsets" in pandas parlance). For example, let's say add 5 weekdays (or business days) to a particular date. Now, there are some arbitrary choices you have to make, e.g. what do you do when the date being added to is not a business day? If you look in https://github.com/wesm/pandas/blob/master/pandas/core/datetools.py you can see some examples. Things are a bit of a mess in places but the key ideas are there. One nice thing about this framework is that you can create subclasses of DateOffset that connect to some external source of information, e.g. trading exchange information-- so if you want to shift by "trading days" for that exchange-- e.g. include holidays and other closures, you can do that relatively seamlessly. Downside is that it's fairly slow generally speaking. I'm hopeful that the datetime64 dtype will enable scikits.timeseries and pandas to consolidate much ofir the datetime / frequency code. scikits.timeseries has a ton of great stuff for generating dates with all the standard fixed frequencies. - Wes From shish at keba.be Mon Jun 6 08:52:18 2011 From: shish at keba.be (Olivier Delalleau) Date: Mon, 6 Jun 2011 08:52:18 -0400 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: I don't really understand your proposed solution, but you can do something like: import heapq q = list(x) heapq.heapify(q) k_smallest = [heapq.heappop(q) for i in xrange(k)] which is in O(n + k log n) -=- Olivier 2011/6/6 Alex Ter-Sarkissov > I have a vector of positive integers length n. Is there a simple (i.e. > without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. > Something like > > *sum(x[sum(x,1)>(max(sum(x,1)+min(sum(x,1))))/2,])* > > but smarter > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cournape at gmail.com Mon Jun 6 09:29:32 2011 From: cournape at gmail.com (David Cournapeau) Date: Mon, 6 Jun 2011 22:29:32 +0900 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 3:15 PM, Alex Ter-Sarkissov wrote: > I have a vector of positive integers length n. Is there a simple (i.e. > without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. Maybe not so simple, but does not require sorting (and its associated o(NlogN) cost): http://en.wikipedia.org/wiki/Selection_algorithm From kwgoodman at gmail.com Mon Jun 6 09:44:38 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Mon, 6 Jun 2011 06:44:38 -0700 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: On Sun, Jun 5, 2011 at 11:15 PM, Alex Ter-Sarkissov wrote: > I have a vector of positive integers length n. Is there a simple (i.e. > without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. > Something like > > sum(x[sum(x,1)>(max(sum(x,1)+min(sum(x,1))))/2,]) > > but smarter You could use bottleneck [1]. It has a partial sort written in cython: >> a = np.arange(10) >> np.random.shuffle(a) >> a array([6, 3, 8, 5, 4, 2, 0, 1, 9, 7]) >> import bottleneck as bn >> k = 3 >> b = bn.partsort(a, a.size-k) >> b[-k:] array([8, 9, 7]) Or you could get the indices: >> idx = bn.argpartsort(a, a.size-k) >> idx[-k:] array([2, 8, 9]) [1] The partial sort will be in Bottleneck 0.5.0. There is a beta2 at https://github.com/kwgoodman/bottleneck/downloads. Release version will be posted at http://pypi.python.org/pypi/Bottleneck From kwgoodman at gmail.com Mon Jun 6 09:58:11 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Mon, 6 Jun 2011 06:58:11 -0700 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:44 AM, Keith Goodman wrote: > On Sun, Jun 5, 2011 at 11:15 PM, Alex Ter-Sarkissov wrote: >> I have a vector of positive integers length n. Is there a simple (i.e. >> without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. >> Something like >> >> sum(x[sum(x,1)>(max(sum(x,1)+min(sum(x,1))))/2,]) >> >> but smarter > > You could use bottleneck [1]. It has a partial sort written in cython: > > ? ?>> a = np.arange(10) > ? ?>> np.random.shuffle(a) > ? ?>> a > ? ? ? array([6, 3, 8, 5, 4, 2, 0, 1, 9, 7]) > ? ?>> import bottleneck as bn > ? ?>> k = 3 > ? ?>> b = bn.partsort(a, a.size-k) > ? ?>> b[-k:] > ? ? ? array([8, 9, 7]) > > Or you could get the indices: > > ? ?>> idx = bn.argpartsort(a, a.size-k) > ? ?>> idx[-k:] > ? ? ? array([2, 8, 9]) > > [1] The partial sort will be in Bottleneck 0.5.0. There is a beta2 at > https://github.com/kwgoodman/bottleneck/downloads. Release version > will be posted at http://pypi.python.org/pypi/Bottleneck Oh, and here are some timings: >> a = np.random.rand(1e6) >> timeit a.sort() 10 loops, best of 3: 18.3 ms per loop >> timeit bn.partsort(a, 100) 100 loops, best of 3: 3.71 ms per loop >> >> a = np.random.rand(1e4) >> timeit a.sort() 1000 loops, best of 3: 170 us per loop >> timeit bn.partsort(a, 10) 10000 loops, best of 3: 19.4 us per loop From gruben at bigpond.net.au Mon Jun 6 09:57:55 2011 From: gruben at bigpond.net.au (gary ruben) Date: Mon, 6 Jun 2011 23:57:55 +1000 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: I learn a lot by watching the numpy and scipy lists (today Olivier taught me about heapq :), but he may not have noticed that Python 2.4 added an nsmallest method) import heapq q = list(x) heapq.heapify(q) k_smallest = heapq.nsmallest(k,q) On Mon, Jun 6, 2011 at 10:52 PM, Olivier Delalleau wrote: > I don't really understand your proposed solution, but you can do something > like: > > import heapq > q = list(x) > heapq.heapify(q) > k_smallest = [heapq.heappop(q) for i in xrange(k)] > > which is in O(n + k log n) > > -=- Olivier > > 2011/6/6 Alex Ter-Sarkissov >> >> I have a vector of positive integers length n. Is there a simple (i.e. >> without sorting/ranking) of 'pulling out' k larrgest (or smallest) values. >> Something like >> >> sum(x[sum(x,1)>(max(sum(x,1)+min(sum(x,1))))/2,]) >> >> but smarter >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From kwgoodman at gmail.com Mon Jun 6 10:06:51 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Mon, 6 Jun 2011 07:06:51 -0700 Subject: [Numpy-discussion] k maximal elements In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:57 AM, gary ruben wrote: > I learn a lot by watching the numpy and scipy lists (today Olivier > taught me about heapq :), but he may not have noticed that Python 2.4 > added an nsmallest method) > > import heapq > q = list(x) > heapq.heapify(q) > k_smallest = heapq.nsmallest(k,q) I learned something new too---heapq. Timings: >> alist = a.tolist() >> heapq.heapify(alist) >> timeit k_smallest = heapq.nsmallest(k, alist) 100 loops, best of 3: 1.84 ms per loop >> timeit bn.partsort(a, a.size-10) 10000 loops, best of 3: 39.3 us per loop From mwwiebe at gmail.com Mon Jun 6 10:41:05 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 09:41:05 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Mon, Jun 6, 2011 at 2:16 AM, Mark Dickinson wrote: > On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe wrote: > > Leap years are easy compared with leap seconds. Leap seconds involve a > > hardcoded table of particular leap-seconds that are added or subtracted, > and > > are specified roughly 6 months in advance of when they happen by > > the International Earth Rotation and Reference Systems Service (IERS). > The > > POSIX time_t doesn't store leap seconds, so if you subtract two time_t > > values you may get the wrong answer by up to 34 seconds (or 24, I'm not > > totally clear on the initial 10 second jump that happened somewhere). > > Times in the future would be hairy, too. If leap seconds are > supported, how many seconds are there in the timedelta: > > 2015-01-01 00:00:00 - 2010-01-01 00:00:00 > > ? Is it acceptable for the result of the subtraction like this to > change when the leap second table is updated (e.g., to reflect a newly > added leap second on June 30th, 2012)? This is an inherent problem with UTC, and because of choosing UTC we have a choice of making time itself nonlinear (the current implementation) or making minutes, hours, days, and weeks nonlinear. With the current approach, the answer to the above will probably always be wrong (unless leap-seconds are eliminated before adding another), and with the latter approach, the answer to the above could be wrong until the leap-seconds are decided in the interval. I'm leaning towards a linear time representation, and nonlinear minutes, hours, etc, with possibly linear versions as well. These units could be split into 'm:UTC' and 'm:linear' for UTC and linear minutes, for example. We could support an unambiguous version of future times by extending the ISO 8601 format with a "TAI" and/or "GPS" instead of "Z" at the end, -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 10:45:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 09:45:51 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: On Mon, Jun 6, 2011 at 4:43 AM, Wes McKinney wrote: > On Mon, Jun 6, 2011 at 8:16 AM, Mark Dickinson > wrote: > > On Thu, Jun 2, 2011 at 5:42 PM, Mark Wiebe wrote: > >> Leap years are easy compared with leap seconds. Leap seconds involve a > >> hardcoded table of particular leap-seconds that are added or subtracted, > and > >> are specified roughly 6 months in advance of when they happen by > >> the International Earth Rotation and Reference Systems Service (IERS). > The > >> POSIX time_t doesn't store leap seconds, so if you subtract two time_t > >> values you may get the wrong answer by up to 34 seconds (or 24, I'm not > >> totally clear on the initial 10 second jump that happened somewhere). > > > > Times in the future would be hairy, too. If leap seconds are > > supported, how many seconds are there in the timedelta: > > > > 2015-01-01 00:00:00 - 2010-01-01 00:00:00 > > > > ? Is it acceptable for the result of the subtraction like this to > > change when the leap second table is updated (e.g., to reflect a newly > > added leap second on June 30th, 2012)? > > > > Mark > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Tangential to the technical details being discussed, but my $0.02 on > datetime generally: > > One thing I tried to do in pandas is enable the implementation of > custom time deltas ("date offsets" in pandas parlance). For example, > let's say add 5 weekdays (or business days) to a particular date. Now, > there are some arbitrary choices you have to make, e.g. what do you do > when the date being added to is not a business day? If you look in > > https://github.com/wesm/pandas/blob/master/pandas/core/datetools.py > > you can see some examples. Things are a bit of a mess in places but > the key ideas are there. One nice thing about this framework is that > you can create subclasses of DateOffset that connect to some external > source of information, e.g. trading exchange information-- so if you > want to shift by "trading days" for that exchange-- e.g. include > holidays and other closures, you can do that relatively seamlessly. > Downside is that it's fairly slow generally speaking. > > I'm hopeful that the datetime64 dtype will enable scikits.timeseries > and pandas to consolidate much ofir the datetime / frequency code. > scikits.timeseries has a ton of great stuff for generating dates with > all the standard fixed frequencies. I'm wondering if removing the business-day unit from datetime64, and adding a business-day API would be a good approach to support all the things that are needed? Any time you add regular days to a business date, you need to resolve the result to NaT or the closest business day earlier or later, so having that always go through an "add_to_business_date" function with a variety of keyword parameters, and various other API functions, might be better in general. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 10:58:20 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 09:58:20 -0500 Subject: [Numpy-discussion] _sort module. In-Reply-To: References: Message-ID: On Sat, Jun 4, 2011 at 1:34 PM, Charles R Harris wrote: > Hi All, > > Numpy has a _sort module which contains *no* methods and whose only purpose > is to modify the type descriptors by adding pointers to the sorting > functions when it is loaded. Consequently _sort is imported early in > numpy/core/__init__.py and is otherwise unused. This really doesn't seem > right. I would like to make the sorting functions into a library or at least > link them directly as part of the build process. I'd also like to move them > into their own directory and break them up into several files in the process > of adding more functions for order statistics. > > Thoughts? > Sounds good to me, I think this belongs in core, and core itself should be split up into subdirectories like you're describing. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 11:10:32 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 10:10:32 -0500 Subject: [Numpy-discussion] Using NumPy iterators in C extension In-Reply-To: <4DEC987C.6050505@physics.ox.ac.uk> References: <4DEC987C.6050505@physics.ox.ac.uk> Message-ID: On Mon, Jun 6, 2011 at 4:06 AM, Irwin Zaid wrote: > Hi all, > > I am writing a C extension for NumPy that implements a custom function. > I have a minor question about iterators, and their order of iteration, > that I would really appreciate some help with. Here goes... > > My function takes a sequence of N-dimensional input arrays and a single > (N+1)-dimensional output array. For each input array, it iterates over > the N dimensions of that array and the leading N dimensions of the > output array. It stores a result in the trailing dimension of the output > array that is cumulative for the entire sequence of input arrays. > > Crucially, I need to be sure that the input and output elements always > match during an iteration, otherwise the output array becomes incorrect. > Meaning, is the kth element visited in the output array during a single > iteration the same for each input array? > > So far, I have implemented this function using two iterators created > with NpyIter_New(), one for an N-dimensional input array and one for the > (N+1)-dimensional output array. I call NpyIter_RemoveAxis() on the > output array iterator. I am using NPY_KEEPORDER for the NPY_ORDER flag, > which I suspect is not what I want. > > How should I guarantee that I preserve order? Should I be using a > different order flag, like NPY_CORDER? Or should I be creating a > multi-array iterator with NpyIter_MultiNew for a single input array and > the output array, though I don't see how to do that in my case. > Using a different order flag would make things consistent, but the better solution is to use NpyIter_AdvancedNew, and provide 'op_axes' to describe how your two arrays relate. Here's the Python equivalent of what I believe you want to do, in this case allowing the iterator to allocate the output for you: >>> import numpy as np >>> >>> a = np.zeros((2,3,4)) >>> >>> it = np.nditer([a,None], ... flags=['multi_index'], ... op_flags=[['readonly'],['writeonly','allocate']], ... op_axes=[range(a.ndim) + [-1], None], ... itershape=(a.shape+(5,))) >>> it.remove_axis(it.ndim-1) >>> it.remove_multi_index() >>> >>> b = it.operands[1] >>> >>> print a.shape, b.shape, it.shape (2, 3, 4) (2, 3, 4, 5) (24,) Cheers, Mark > Thanks in advance for your help! I can provide more information if > something is unclear. > > Best, > > Irwin > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 11:30:01 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 10:30:01 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers wrote: > On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: > >> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>> > Would anyone object to, at least temporarily, tightening up the >>> default >>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one line >>> change, >>> > so would be easy to undo, but such a change is very desirable in my >>> > opinion. >>> > This would raise an exception, since it's np.add(a, 1.9, out=a), >>> > converting a float to an int: >>> >>> > >>> a = np.arange(3, dtype=np.int32) >>> >>> > >>> a += 1.9 >>> >>> That's probably going to break a huge amount of code which relies on the >>> current behavior. >>> >>> Am I right in believing that this should only be considered for a major >>> release of numpy, say numpy 2.0? >> >> >> Absolutely, and that's why I'm proposing to do it in master now, fairly >> early in a development cycle, so we can evaluate its effects. If the next >> version is 1.7, we probably would roll it back for release (a 1 line >> change), and if the next version is 2.0, we probably would keep it in. >> >> I suspect at least some of the code relying on the current behavior may >> have bugs, and tightening this up is a way to reveal them. >> >> > Here are some results of testing your tighten_casting branch on a few > projects - no need to first put it in master first to do that. Four failures > in numpy, two in scipy, four in scikit-learn (plus two that don't look > related), none in scikits.statsmodels. I didn't check how many of them are > actual bugs. > > I'm not against trying out your change, but it would probably be good to do > some more testing first and fix the issues found before putting it in. Then > at least if people run into issues with the already tested packages, you can > just tell them to update those to latest master. > Cool, thanks for running those. I already took a chunk out of the NumPy failures. The ones_like function shouldn't really be a ufunc, but rather be like zeros_like and empty_like, but that's probably not something to change right now. The datetime-fixes type resolution change provides a mechanism to fix that up pretty easily. For Scipy, what do you think is the best way to resolve it? If NumPy 1.6 is the minimum version for the next scipy, I would add casting='unsafe' to the failing sqrt call. -Mark > Cheers, > Ralf > > > NUMPY > > ====================================================================== > ERROR: test_ones_like (test_numeric.TestLikeFuncs) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/Users/rgommers/Code/numpy/numpy/core/tests/test_numeric.py", line > 1257, in test_ones_like > self.check_like_function(np.ones_like, 1) > File "/Users/rgommers/Code/numpy/numpy/core/tests/test_numeric.py", line > 1200, in check_like_function > dz = like_function(d, dtype=dtype) > TypeError: found a loop for ufunc 'ones_like' matching the type-tuple, but > the inputs and/or outputs could not be cast according to the casting rule > > ====================================================================== > ERROR: test_methods_with_output (test_core.TestMaskedArrayArithmetic) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1109, > in test_methods_with_output > result = xmmeth(axis=0, out=output) > File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 4774, in std > out **= 0.5 > File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3765, in > __ipow__ > ndarray.__ipow__(self._data, np.where(self._mask, 1, other_data)) > TypeError: ufunc 'power' output (typecode 'd') could not be coerced to > provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > > ====================================================================== > ERROR: Check that we don't shrink a mask when not wanted > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1044, > in test_noshrinking > a /= 1. > File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3725, in > __idiv__ > ndarray.__idiv__(self._data, np.where(self._mask, 1, other_data)) > TypeError: ufunc 'divide' output (typecode 'd') could not be coerced to > provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > > ====================================================================== > ERROR: Test of inplace additions > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/Users/rgommers/Code/numpy/numpy/ma/tests/test_core.py", line 1651, > in test_inplace_addition_array > x += a > File "/Users/rgommers/Code/numpy/numpy/ma/core.py", line 3686, in > __iadd__ > ndarray.__iadd__(self._data, np.where(self._mask, 0, getdata(other))) > TypeError: ufunc 'add' output (typecode 'd') could not be coerced to > provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > > > > SCIPY > > ====================================================================== > ERROR: gaussian gradient magnitude filter 1 > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", > line 187, in runTest > self.test(*self.arg) > File "/Users/rgommers/Code/scipy/scipy/ndimage/tests/test_ndimage.py", > line 670, in test_gaussian_gradient_magnitude01 > 1.0) > File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 479, in > gaussian_gradient_magnitude > cval, extra_arguments = (sigma,)) > File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 450, in > generic_gradient_magnitude > numpy.sqrt(output, output) > TypeError: ufunc 'sqrt' output (typecode 'd') could not be coerced to > provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > > ====================================================================== > ERROR: gaussian gradient magnitude filter 2 > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", > line 187, in runTest > self.test(*self.arg) > File "/Users/rgommers/Code/scipy/scipy/ndimage/tests/test_ndimage.py", > line 685, in test_gaussian_gradient_magnitude02 > output) > File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 479, in > gaussian_gradient_magnitude > cval, extra_arguments = (sigma,)) > File "/Users/rgommers/Code/scipy/scipy/ndimage/filters.py", line 450, in > generic_gradient_magnitude > numpy.sqrt(output, output) > TypeError: ufunc 'sqrt' output (typecode 'd') could not be coerced to > provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > > > > SCIKIT-LEARN > > ====================================================================== > ERROR: Test decision_function > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", > line 187, in runTest > self.test(*self.arg) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/tests/test_svm.py", > line 254, in test_decision_function > assert_array_almost_equal(-dec, np.ravel(clf.decision_function(data))) > File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/base.py", line > 276, in decision_function > **self._get_params()) > File "libsvm.pyx", line 385, in > scikits.learn.svm.libsvm.decision_function (scikits/learn/svm/libsvm.c:4748) > ValueError: ndarray is not C-contiguous > > ====================================================================== > ERROR: Test weights on individual samples > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.0.0-py2.6.egg/nose/case.py", > line 187, in runTest > self.test(*self.arg) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/tests/test_svm.py", > line 282, in test_sample_weights > assert_array_equal(clf.predict(X[2]), [1.]) > File "/Users/rgommers/Code/scikit-learn/scikits/learn/svm/base.py", line > 186, in predict > svm_type=svm_type, **self._get_params()) > File "libsvm.pyx", line 224, in scikits.learn.svm.libsvm.predict > (scikits/learn/svm/libsvm.c:3133) > ValueError: ndarray is not C-contiguous > > ====================================================================== > FAIL: Doctest: scikits.learn.decomposition.pca.PCA > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 2145, in runTest > raise self.failureException(self.format_failure(new.getvalue())) > AssertionError: Failed doctest test for scikits.learn.decomposition.pca.PCA > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 94, in PCA > > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 161, in scikits.learn.decomposition.pca.PCA > Failed example: > pca.fit(X) > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", line 1, in > > pca.fit(X) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 191, in fit > self._fit(X, **params) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 227, in _fit > X -= self.mean_ > TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced > to provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 163, in scikits.learn.decomposition.pca.PCA > Failed example: > print pca.explained_variance_ratio_ > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", line 1, in > > print pca.explained_variance_ratio_ > AttributeError: 'PCA' object has no attribute > 'explained_variance_ratio_' > > >> raise self.failureException(self.format_failure( instance at 0xfe05648>.getvalue())) > > > ====================================================================== > FAIL: Doctest: scikits.learn.decomposition.pca.ProbabilisticPCA > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 2145, in runTest > raise self.failureException(self.format_failure(new.getvalue())) > AssertionError: Failed doctest test for > scikits.learn.decomposition.pca.ProbabilisticPCA > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 273, in ProbabilisticPCA > > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 342, in scikits.learn.decomposition.pca.ProbabilisticPCA > Failed example: > pca.fit(X) > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", > line 1, in > pca.fit(X) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 191, in fit > self._fit(X, **params) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 227, in _fit > X -= self.mean_ > TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced > to provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 344, in scikits.learn.decomposition.pca.ProbabilisticPCA > Failed example: > print pca.explained_variance_ratio_ > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", > line 1, in > print pca.explained_variance_ratio_ > AttributeError: 'PCA' object has no attribute > 'explained_variance_ratio_' > > >> raise self.failureException(self.format_failure( instance at 0xfd8d1c0>.getvalue())) > > > ====================================================================== > FAIL: Doctest: scikits.learn.decomposition.pca.RandomizedPCA > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 2145, in runTest > raise self.failureException(self.format_failure(new.getvalue())) > AssertionError: Failed doctest test for > scikits.learn.decomposition.pca.RandomizedPCA > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 330, in RandomizedPCA > > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 377, in scikits.learn.decomposition.pca.RandomizedPCA > Failed example: > pca.fit(X) > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", > line 1, in > pca.fit(X) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 438, in fit > X -= self.mean_ > TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced > to provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > ---------------------------------------------------------------------- > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/decomposition/pca.py", line > 379, in scikits.learn.decomposition.pca.RandomizedPCA > Failed example: > print pca.explained_variance_ratio_ > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", > line 1, in > print pca.explained_variance_ratio_ > AttributeError: 'RandomizedPCA' object has no attribute > 'explained_variance_ratio_' > > >> raise self.failureException(self.format_failure( instance at 0xfe05670>.getvalue())) > > > ====================================================================== > FAIL: Doctest: scikits.learn.mixture.GMM > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 2145, in runTest > raise self.failureException(self.format_failure(new.getvalue())) > AssertionError: Failed doctest test for scikits.learn.mixture.GMM > File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line > 127, in GMM > > ---------------------------------------------------------------------- > File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line > 226, in scikits.learn.mixture.GMM > Failed example: > g.fit(20 * [[0]] + 20 * [[10]]) > Exception raised: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1241, in __run > compileflags, 1) in test.globs > File "", line 1, in > g.fit(20 * [[0]] + 20 * [[10]]) > File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", > line 480, in fit > k=self._n_states).fit(X).cluster_centers_ > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/cluster/k_means_.py", line > 432, in fit > tol=self.tol, random_state=self.random_state, copy_x=self.copy_x) > File > "/Users/rgommers/Code/scikit-learn/scikits/learn/cluster/k_means_.py", line > 199, in k_means > X -= Xmean > TypeError: ufunc 'subtract' output (typecode 'd') could not be coerced > to provided output parameter (typecode 'l') according to the casting rule > 'same_kind' > ---------------------------------------------------------------------- > File "/Users/rgommers/Code/scikit-learn/scikits/learn/mixture.py", line > 228, in scikits.learn.mixture.GMM > Failed example: > np.round(g.weights, 2) > Expected: > array([ 0.5, 0.5]) > Got: > array([ 0.25, 0.75]) > > >> raise self.failureException(self.format_failure( instance at 0xfde5238>.getvalue())) > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From RadimRehurek at seznam.cz Mon Jun 6 11:44:30 2011 From: RadimRehurek at seznam.cz (RadimRehurek) Date: Mon, 06 Jun 2011 17:44:30 +0200 (CEST) Subject: [Numpy-discussion] k maximal elements In-Reply-To: Message-ID: <26563.7260.17337-18986-1157508412-1307375070@seznam.cz> > On Mon, Jun 6, 2011 at 6:57 AM, gary ruben wrote: > > I learn a lot by watching the numpy and scipy lists (today Olivier > > taught me about heapq :), but he may not have noticed that Python 2.4 > > added an nsmallest method) I needed indices of the selected elements as well (not just the k-smallest values themselves) some time ago and compared heapq vs. sort vs. numpy.argsort. Timing results (for my specific usecase) are at https://github.com/piskvorky/gensim/issues/5 . The results are as expected: argsort is a must, forming (value, index) tuples explicitly and then sorting that with sort/heapq kills performance. Best, Radim From mwwiebe at gmail.com Mon Jun 6 12:56:37 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 11:56:37 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Mon, Jun 6, 2011 at 10:30 AM, Mark Wiebe wrote: > On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers wrote: > >> On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: >> >>> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >>> gael.varoquaux at normalesup.org> wrote: >>> >>>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>>> > Would anyone object to, at least temporarily, tightening up the >>>> default >>>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one line >>>> change, >>>> > so would be easy to undo, but such a change is very desirable in my >>>> > opinion. >>>> > This would raise an exception, since it's np.add(a, 1.9, out=a), >>>> > converting a float to an int: >>>> >>>> > >>> a = np.arange(3, dtype=np.int32) >>>> >>>> > >>> a += 1.9 >>>> >>>> That's probably going to break a huge amount of code which relies on the >>>> current behavior. >>>> >>>> Am I right in believing that this should only be considered for a major >>>> release of numpy, say numpy 2.0? >>> >>> >>> Absolutely, and that's why I'm proposing to do it in master now, fairly >>> early in a development cycle, so we can evaluate its effects. If the next >>> version is 1.7, we probably would roll it back for release (a 1 line >>> change), and if the next version is 2.0, we probably would keep it in. >>> >>> I suspect at least some of the code relying on the current behavior may >>> have bugs, and tightening this up is a way to reveal them. >>> >>> >> Here are some results of testing your tighten_casting branch on a few >> projects - no need to first put it in master first to do that. Four failures >> in numpy, two in scipy, four in scikit-learn (plus two that don't look >> related), none in scikits.statsmodels. I didn't check how many of them are >> actual bugs. >> >> I'm not against trying out your change, but it would probably be good to >> do some more testing first and fix the issues found before putting it in. >> Then at least if people run into issues with the already tested packages, >> you can just tell them to update those to latest master. >> > > Cool, thanks for running those. I already took a chunk out of the NumPy > failures. The ones_like function shouldn't really be a ufunc, but rather be > like zeros_like and empty_like, but that's probably not something to change > right now. The datetime-fixes type resolution change provides a mechanism to > fix that up pretty easily. > > For Scipy, what do you think is the best way to resolve it? If NumPy 1.6 is > the minimum version for the next scipy, I would add casting='unsafe' to the > failing sqrt call. > I've updated the tighten_casting branch so it now passes all tests. For masked arrays, this required changing some tests to not assume float -> int casts are fine by default, but otherwise I fixed things by relaxing the rules just where necessary. It now depends on the datetime-fixes branch, which I would like to merge at its current point. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Mon Jun 6 14:33:42 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Mon, 06 Jun 2011 11:33:42 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> Message-ID: <4DED1D86.5070304@noaa.gov> Mark Wiebe wrote: > I'm wondering if removing the business-day unit from datetime64, and > adding a business-day API would be a good approach to support all the > things that are needed? That sounds like a good idea to me -- and perhaps it could be a general Calendar Functions API, to handle other issues as well. Looking again at the NEP, I see "business day" as a time unit, then below, that "B" is interpreted as "24h, 1440m, 86400s" i.e. a day. This seems like a really bad idea -- it implies that you can convert from business day to hours, and back again, but, of course, if you do: start_date + n business_days You should not get the same thing as: start_date + (n * 24) hrs which is why I think that a "business day" is not really a unit -- it is a specification for a Calendar operation. My thought is that a datetime should not be able to be expressed in units like business days. A timedelta should, but then there needs to be a Calendar(i.e. a set of rules) associated with it, and it should be clearly distinct from "linear unit" timedeltas. Business days aside, I also see that months and years are given "Interpreted as" definitions: Code Interpreted as Y 12M, 52W, 365D M 4W, 30D, 720h This is even self inconsistent: 1Y == 365D 1Y == 12M == 12 * 30D == 360D 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D 1Y == 52W == 52 * 7D == 364D Is it not clear from this what a mess of mis-interpretation might result from all that? In thinking more, numpy's needs are a little different that the netcdf standards -- netcdf is a way to transmit and store information -- some of the key problems that come up is that people develop tools to do things with the data that make assumptions. For instance, my tools that work with CF-compliant netcdf pretty much always convert the "time" axis (expresses as time_units since a_datetime) into python datetime objects, and then go from there. This goes to heck is the data is expressed in something like "months since 1995-01-01" Because months are only defined on a Calendar. Anyway, this kind of thing may be less of an issue because we use numpy to write the tools, not to store and share the data, so hopefully, the tool author knows what she/he is working with. But I still think the distinction between linear, convertible, time units, and time units that vary depending on where you are on which calendar, and what calendar you are using, should be kept clearly distinct. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From akshar.bhosale at gmail.com Mon Jun 6 14:55:35 2011 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Tue, 7 Jun 2011 00:25:35 +0530 Subject: [Numpy-discussion] error when numpy is imported In-Reply-To: References: Message-ID: thanks for pointer..it helped me a lot. -akshar On Sat, Jun 4, 2011 at 9:39 PM, Ralf Gommers wrote: > > > On Fri, Jun 3, 2011 at 11:05 PM, akshar bhosale wrote: > >> [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> import numpy >> Traceback (most recent call last): >> File "", line 1, in >> File "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/__init__.py", >> line 137, in >> import add_newdocs >> File >> "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/add_newdocs.py", line 9, >> in >> from numpy.lib import add_newdoc >> File >> "/homeadmin/numpy-1.6.0b2/install/lib/python/numpy/lib/__init__.py", line >> 13, in >> from polynomial import * >> File >> "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/lib/polynomial.py", line >> 17, in >> from numpy.linalg import eigvals, lstsq >> File >> "/home/admin/numpy-1.6.0b2/install/lib/python/numpy/linalg/__init__.py", >> line 48, in >> from linalg import * >> File >> "/home/dmin/numpy-1.6.0b2/install/lib/python/numpy/linalg/linalg.py", line >> 23, in >> from numpy.linalg import lapack_lite >> ImportError: >> /home/external/Python-2.6/lib/python2.6/site-packages/numpy/linalg/lapack_lite.so: >> undefined symbol: _gfortran_concat_string >> >> We have intel mkl with which we are trying to build numpy and scipy. >> >> It looks like you are mixing different fortran compilers, which doesn't > work. > > Ralf > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 15:31:07 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 14:31:07 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DED1D86.5070304@noaa.gov> References: <4DE7B8C0.4040805@noaa.gov> <4DED1D86.5070304@noaa.gov> Message-ID: On Mon, Jun 6, 2011 at 1:33 PM, Christopher Barker wrote: > Mark Wiebe wrote: > > I'm wondering if removing the business-day unit from datetime64, and > > adding a business-day API would be a good approach to support all the > > things that are needed? > > That sounds like a good idea to me -- and perhaps it could be a general > Calendar Functions API, to handle other issues as well. > > > > My thought is that a datetime should not be able to be expressed in > units like business days. A timedelta should, but then there needs to be > a Calendar(i.e. a set of rules) associated with it, and it should be > clearly distinct from "linear unit" timedeltas. > I think business day should be removed from timedelta as well, and all this should be functions operating on the days unit. Having the definition of a work/trading day be a function parameter instead of a part of a unit metadata makes this more explicit and eliminates a lot of potential clutter in the implementation, which has a lot to deal with at that level already. > Business days aside, I also see that months and years are given > "Interpreted as" definitions: > > Code Interpreted as > Y 12M, 52W, 365D > M 4W, 30D, 720h > > This is even self inconsistent: > > 1Y == 365D > > 1Y == 12M == 12 * 30D == 360D > > 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D > > 1Y == 52W == 52 * 7D == 364D > > Is it not clear from this what a mess of mis-interpretation might result > from all that? > This part of the code is used for mapping metadata like [Y/4] -> [3M], or [Y/26] -> [2W]. I agree that this '/' operator in the unit metadata is weird, and wouldn't object to removing it. > In thinking more, numpy's needs are a little different that the netcdf > standards -- netcdf is a way to transmit and store information -- some > of the key problems that come up is that people develop tools to do > things with the data that make assumptions. For instance, my tools that > work with CF-compliant netcdf pretty much always convert the "time" axis > (expresses as time_units since a_datetime) into python datetime objects, > and then go from there. > > This goes to heck is the data is expressed in something like "months > since 1995-01-01" > > Because months are only defined on a Calendar. > Here's what the current implementation can do with that one: >>> np.datetime64('1995-01-01', 'M') + 13 numpy.datetime64('1996-02','M') > > Anyway, this kind of thing may be less of an issue because we use numpy > to write the tools, not to store and share the data, so hopefully, the > tool author knows what she/he is working with. But I still think the > distinction between linear, convertible, time units, and time units that > vary depending on where you are on which calendar, and what calendar you > are using, should be kept clearly distinct. > Yeah, I'm still working out something that strikes a balance for all these concepts. -Mark > > > -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Mon Jun 6 15:32:09 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Mon, 6 Jun 2011 21:32:09 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DED1D86.5070304@noaa.gov> References: <4DE7B8C0.4040805@noaa.gov> <4DED1D86.5070304@noaa.gov> Message-ID: On Jun 6, 2011, at 8:33 PM, Christopher Barker wrote: > Mark Wiebe wrote: >> I'm wondering if removing the business-day unit from datetime64, and >> adding a business-day API would be a good approach to support all the >> things that are needed? > > That sounds like a good idea to me -- and perhaps it could be a general > Calendar Functions API, to handle other issues as well. +1 as well. > Looking again at the NEP, I see "business day" as a time unit, then > below, that "B" is interpreted as "24h, 1440m, 86400s" i.e. a day. Well, "B" was initially interpreted as Mon-Fri, which works well enough as long as you don't take holidays into account (that's how we did it in scikits.timeseries. > > Business days aside, I also see that months and years are given > "Interpreted as" definitions: > > Code Interpreted as > Y 12M, 52W, 365D > M 4W, 30D, 720h > > This is even self inconsistent: We had some neat conversion rules in scikits.timeseries, supporting years and months. It's not as messy as you want to convince yourself, don't worry ;) There was one issue, though. When converting to a higher frequency, we had to decide whether to put the new point at the beginning or the end of the period. That was managed through a keyword (see http://pytseries.sourceforge.net/generated/scikits.timeseries.TimeSeries.convert.html#scikits.timeseries.TimeSeries.convert for more details). But apart from that, it can be handled quite neatly. > > This goes to heck is the data is expressed in something like "months > since 1995-01-01" > > Because months are only defined on a Calendar. Using the ISO as reference, you have a good definition of months. From mwwiebe at gmail.com Mon Jun 6 15:46:25 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 14:46:25 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> <4DED1D86.5070304@noaa.gov> Message-ID: On Mon, Jun 6, 2011 at 2:31 PM, Mark Wiebe wrote: > On Mon, Jun 6, 2011 at 1:33 PM, Christopher Barker wrote: > >> >> > This is even self inconsistent: >> >> 1Y == 365D >> >> 1Y == 12M == 12 * 30D == 360D >> >> 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D >> >> 1Y == 52W == 52 * 7D == 364D >> >> Is it not clear from this what a mess of mis-interpretation might result >> from all that? >> > > This part of the code is used for mapping metadata like [Y/4] -> [3M], or > [Y/26] -> [2W]. I agree that this '/' operator in the unit metadata is > weird, and wouldn't object to removing it. > I'd be happy to remove the multiplier before the unit name too. The NEP gives a rationale mentioning C#'s tick unit being 100ns, but I don't see that or the other reasons providing a good explanation for it. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 6 16:24:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 15:24:17 -0500 Subject: [Numpy-discussion] dealing with datetime UTC vs linear time Message-ID: I've given the UTC leap-second issue a bit of thought, and here are some related requirements I've come up with: 1. The underlying data representation of datetime64 should be linearly related to time. 2. Default calendar functionality should be based on ISO 8601, and by reference be in UTC 3. It should be possible to unambiguously specify a time at any point in the past or the future, and get such a representation back from NumPy. One problem trying to resolve these requirements with respect to UTC is that either the underlying time has jump discontinuities, like in POSIX time_t, or minutes, hours, days, and weeks are sometimes one second longer or shorter. Additionally, it's not possible in general to specify UTC times with second resolution 6 or more months in advance. Here's what I propose NumPy's datetime do: A. Use a linear time representation, point 1. Conversion to/from ISO 8601 strings thus requires leap-second knowledge. B. Extend ISO8601 with a "TAI" suffix which is just like "Z" but has no leap-seconds and is 34 seconds ahead of an equivalent "Z" time at the present. This satisfies point 3. C. Add a metadata item which chooses between "UTC" and "TAI". For seconds and finer, converting between UTC and TAI units is safe, and for minutes and coarser, is unsafe. UTC involves leap-seconds and an ambiguity-resolution in the few cases that need it, while in TAI everything is linear. Local time zone-based formatting is only available for UTC datetimes. By default, all units will be UTC, but either TAI or UTC can be specified like 'M8[D:TAI]' or 'M8[h:UTC]'. Does this design sound reasonable to people? Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdickinson at enthought.com Mon Jun 6 16:47:38 2011 From: mdickinson at enthought.com (Mark Dickinson) Date: Mon, 6 Jun 2011 21:47:38 +0100 Subject: [Numpy-discussion] dealing with datetime UTC vs linear time In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 9:24 PM, Mark Wiebe wrote: Sorry for being dense, but: > C. Add a metadata item which chooses between "UTC" and "TAI". For seconds > and finer, converting between UTC and TAI units is safe, and for minutes and > coarser, is unsafe. UTC involves leap-seconds This bit I understand, but: > and an ambiguity-resolution in the few cases that need it, Where does the ambiguity come from? (Assuming we're not more than 6 months into the future.) There would be ambiguity with POSIX time, but doesn't UTC manage to specify any point in time unambiguously? > [...] > Does this design sound reasonable to people? Python's datetime and timedelta types use POSIX time (i.e., just pretend that leap seconds don't exist and hope that everything works out okay). Would this cause problems when converting between Python and NumPy types? I guess part of the point of this work is that with full datetime support in NumPy the need for such conversions would be relatively rare, so maybe there's no real problem here. But having thought quite hard about some of these issues for Python's datetime module recently, I'm sceptical that leap second support is really that valuable, and fear that having it turned on by default will cause nasty surprises for those who have no need to think about leap seconds and write code that assumes that a minute has 60 seconds. Mark From mwwiebe at gmail.com Mon Jun 6 18:12:56 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 6 Jun 2011 17:12:56 -0500 Subject: [Numpy-discussion] dealing with datetime UTC vs linear time In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 3:47 PM, Mark Dickinson wrote: > On Mon, Jun 6, 2011 at 9:24 PM, Mark Wiebe wrote: > > Sorry for being dense, but: > > > C. Add a metadata item which chooses between "UTC" and "TAI". For seconds > > and finer, converting between UTC and TAI units is safe, and for minutes > and > > coarser, is unsafe. UTC involves leap-seconds > > This bit I understand, but: > > > and an ambiguity-resolution in the few cases that need it, > > Where does the ambiguity come from? (Assuming we're not more than 6 > months into the future.) There would be ambiguity with POSIX time, > but doesn't UTC manage to specify any point in time unambiguously? > If you add one minute to an added leap-second, you end up at second 60 in a minute which has only 60 seconds, and you have to fall back to second 59 or forward to second 0 of the next minute. Also, if you add a minute which should land where a leap-second was removed, you have to fall back to second 58 or second 0 of the next minute. > [...] > > Does this design sound reasonable to people? > > Python's datetime and timedelta types use POSIX time (i.e., just > pretend that leap seconds don't exist and hope that everything works > out okay). Would this cause problems when converting between Python > and NumPy types? Converting from the Python type to the NumPy type has a problem for removed leap-seconds (of which there are currently none), and converting from the NumPy type to the Python type has a problem for added leap-seconds. To allow for round-tripping to/from python objects, I think I would just make it return the underlying 64-bit number instead of a Python datetime object. This is also what it does for years outside of the range Python accepts. > I guess part of the point of this work is that with > full datetime support in NumPy the need for such conversions would be > relatively rare, so maybe there's no real problem here. > Yeah, as long as the calendar-related API is sufficient, I think that's true. But having thought quite hard about some of these issues for Python's > datetime module recently, I'm sceptical that leap second support is > really that valuable, and fear that having it turned on by default > will cause nasty surprises for those who have no need to think about > leap seconds and write code that assumes that a minute has 60 seconds. > I think that by having some default ambiguity-resolution rules for leap seconds, things will generally work out as people expect. On the other hand, with POSIX time, anybody that wants things precise with respect to leap-seconds is automatically excluded. The potential surprises from leap-seconds seem milder than the surprises from time zones and the multitude of possible calendars out there. I found it disconcerting that datetime.datetime.now() returns local time with no time zone indication on it. Here's how I've resolved that issue in NumPy: >>> datetime.datetime.now() datetime.datetime(2011, 6, 6, 17, 10, 32, 444658) >>> np.datetime64('now') numpy.datetime64('2011-06-06T17:10:41.000000-0500','us') Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mattknox.ca at gmail.com Mon Jun 6 20:10:07 2011 From: mattknox.ca at gmail.com (Matt Knox) Date: Tue, 7 Jun 2011 00:10:07 +0000 (UTC) Subject: [Numpy-discussion] dealing with datetime UTC vs linear time References: Message-ID: Mark Wiebe gmail.com> writes: > I guess part of the point of this work is that with > full datetime support in NumPy the need for such conversions would be > relatively rare, so maybe there's no real problem here. > > > Yeah, as long as the calendar-related API is sufficient, I think that's true.? > I've been watching this thread for a bit with some interest. Just thought I'd chime in quickly about this item. I use the scikits.timeseries module heavily and converting to/from python datetime objects easily is essential for me. I work extensively with relational databases (sql server, mysql, etc) and all (most?) of the python database adapters use python datetime objects when passing data back and forth from the database to python. So people wanting to analyze date / time series data from relational databases in numpy would be affected by this. - Matt Knox From mdickinson at enthought.com Tue Jun 7 03:29:52 2011 From: mdickinson at enthought.com (Mark Dickinson) Date: Tue, 7 Jun 2011 08:29:52 +0100 Subject: [Numpy-discussion] dealing with datetime UTC vs linear time In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 11:12 PM, Mark Wiebe wrote: > On Mon, Jun 6, 2011 at 3:47 PM, Mark Dickinson >> On Mon, Jun 6, 2011 at 9:24 PM, Mark Wiebe wrote: >> > and an ambiguity-resolution in the few cases that need it, >> >> Where does the ambiguity come from? ?(Assuming we're not more than 6 >> months into the future.) ?There would be ambiguity with POSIX time, >> but doesn't UTC manage to specify any point in time unambiguously? > > If you add one minute to an added leap-second [...] Ah, yes. I wasn't thinking as far ahead as arithmetic. Thanks. > [...] > I think that by having some default ambiguity-resolution rules for leap > seconds, things will generally work out as people expect. Yes, I expect that's true. > On the other hand, > with POSIX time, anybody that wants things precise with respect to > leap-seconds is automatically excluded. The potential surprises from > leap-seconds seem milder than the surprises from time zones and the > multitude of possible calendars out there. Well, except that people are quite used to having to deal with time zones / DST / etc. as a matter of course, and that the surprises from time zones are likely to bite earlier (i.e., with luck, during development rather than after deployment). I'd argue that the rarity of leap seconds (both in terms of ratio of leap seconds to non-leap seconds, and in terms of the number of applications that really have to care about leap seconds) would make for late-surfacing bugs in application code; so I'd personally prefer to see an 'opt-in' version of leap second support, rather than 'opt-out'. Still, I'm largely guessing here, so take with a large pinch of salt. (If this were a Python mailing list, I'd trot out 'Practicality Beats Purity' from the Zen here.) > I found it disconcerting that > datetime.datetime.now() returns local time with no time zone indication on > it. Yes, I agree that that's disturbing. datetime support for time zones isn't the best. Thanks for the detailed response. I'll go back to lurking now. Mark From dave.hirschfeld at gmail.com Tue Jun 7 08:34:16 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Tue, 7 Jun 2011 12:34:16 +0000 (UTC) Subject: [Numpy-discussion] fixing up datetime Message-ID: As a user of numpy/scipy in finance I thought I would put in my 2p worth as it's something which is of great importance in this area. I'm currently a heavy user of the scikits.timeseries package by Matt & Pierre and I'm also following the development of statsmodels and pandas should we require more sophisticated statistics in future. Hopefully the numpy datetime type will provide a foundation such packages can build upon... I'll use the timeseries package for reference since I'm most familiar with it and it's a very good api for my requirements. Apologies to Matt/Pierre if I get anything wrong - feel free to correct my misconceptions... I think some of the complexity is coming from the definition of the timedelta. In the timeseries package each date simply represents the number of periods since the epoch and the difference between dates is therefore just and integer with no attached metadata - its meaning is determined by the context it's used in. e.g. In [56]: M1 = ts.Date('M',"01-Jan-2011") In [57]: M2 = ts.Date('M',"01-Jan-2012") In [58]: M2 - M1 Out[58]: 12 timeseries gets on just fine without a timedelta type - a timedelta is just an integer and if you add an integer to a date it's interpreted as the number of periods of that dates frequency. From a useability point of view M1 + 1 is much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1). Something like the dateutil relativedelta pacage is very convenient and could serve as a template for such functionality: In [59]: from dateutil.relativedelta import relativedelta In [60]: (D1 + 30).datetime Out[60]: datetime.datetime(2011, 1, 31, 0, 0) In [61]: (D1 + 30).datetime + relativedelta(months=1) Out[61]: datetime.datetime(2011, 2, 28, 0, 0) ...but you can still get the same behaviour without a timedelta by asking that the user explicitly specify what they mean by "adding one month" to a date of a different frequency. e.g. In [62]: (D1 + 30) Out[62]: In [63]: _62.asfreq('M') + 1 Out[63]: In [64]: (_62.asfreq('M') + 1).asfreq('D','END') Out[64]: In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day Out[65]: As Pierre noted when converting dates from a lower frequency to a higher one it's very useful (essential!) to be able to specify whether you want the end or the start of the interval. It may also be useful to be able to specify an arbitrary offset from either the start or the end of the interval so you could do something like: In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0) Out[66]: In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1) Out[67]: In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15) Out[68]: I don't think it's useful to define higher 'frequencies' as arbitrary multiples of lower 'frequencies' unless the conversion is exact otherwise it leads to the following inconsistencies: In [69]: days_per_month = 30 In [70]: D1 = M1.asfreq('D',relation='START') In [71]: D2 = M2.asfreq('D','START') In [72]: D1, D2 Out[72]: (, ) In [73]: D1 + days_per_month*(M2 - M1) Out[73]: In [74]: D1 + days_per_month*(M2 - M1) == D2 Out[74]: False If I want the number of days between M1 and M2 I explicitely do the conversion myself: In [75]: M2.asfreq('D','START') - M1.asfreq('D','START') Out[75]: 365 thus avoiding any inconsistency: In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2 Out[76]: True I'm not convinced about the events concept - it seems to add complexity for something which could be accomplished better in other ways. A [Y]//4 dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There may well be a good reason for it however I can't see the need for it in my own applications. In the timeseries package, because the difference between dates represents the number of periods between the dates they must be of the same frequency to unambiguopusly define what a "period" means: In [77]: M1 - D1 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) C:\dev\code\ in () ValueError: Cannot subtract Date objects with different frequencies. I would argue that in the spirit that we're all consenting adults adding dates of the same frequency can be a useful thing for example in finding the mid-point between two dates: In [78]: M1.asfreq('S','START') Out[78]: In [79]: M2.asfreq('S','START') Out[79]: In [80]: ts.Date('S', (_64.value + _65.value)//2) Out[80]: I think any errors which arose from adding or multiplying dates would be pretty easy to spot in your code. As Robert mentioned the economic data we use is often supplied as weekly, monthly, quarterly or annual data. So these frequencies are critical if we're to use the the array as a container for economic data. Such data would usually represent either the sum or the average over that period so it's very easy to get a consistent "linear-time" representation by interpolating down to a higher frequency such as daily or hourly. I really like the idea of being able to specify multiples of the base frequency - e.g. [7D] is equivalenty to [W] not the least because it provides an easy way to specify quarters [3M] or seasons [6M] which are important in my work. NB: I also deal with half-hourly and quarter-hourly timeseries and I'm sure there are many other example which are all made possible by allowing multipliers. One aspect of this is that the origin becomes important - i.e. does the week [7D] start on Monday/Tuesday etc. In scikits.timeseries this is solved by defining a different weekly frequency for each day of the week, a different annual frequency starting at each month etc... http://pytseries.sourceforge.net/core.constants.html I'm thinking however that it may be possible to use the origin/zero-date/epoch attribute to define the start of such periods - e.g. if you had a weekly [7D] frequency and the origin was 01-Jan-1970 then each week would be defined as a Thursday-Thursday week. To get a Monday-Monday week you could supply 05-Jan-1970 as the origin attribute. Unfortunately business days and holidays are also very important in finance, however I agree that this may be better suited to a calendar API. I would suggest that leap seconds would be something which could also be handled by this API rather than having such complex functionality baked in by default. I'm not sure how this could be implemented in practice except for some vague thoughts about providing hooks where users could provide functions which converted to and from an integer representation for their particular calendar. Creating a weekday calendar would be a good test-case for such an API. Apologies for the very long post! I guess it can be mostly summarised as you've got a pretty good template for functionality in scikits.timeseries! Pandas/statsmodels may have more sophisticated requirements though so their input on the finance/econometric side would be useful... -Dave From mwwiebe at gmail.com Tue Jun 7 09:09:36 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 08:09:36 -0500 Subject: [Numpy-discussion] dealing with datetime UTC vs linear time In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 2:29 AM, Mark Dickinson wrote: > On Mon, Jun 6, 2011 at 11:12 PM, Mark Wiebe wrote: > > > > On the other hand, > > with POSIX time, anybody that wants things precise with respect to > > leap-seconds is automatically excluded. The potential surprises from > > leap-seconds seem milder than the surprises from time zones and the > > multitude of possible calendars out there. > > Well, except that people are quite used to having to deal with time > zones / DST / etc. as a matter of course, and that the surprises from > time zones are likely to bite earlier (i.e., with luck, during > development rather than after deployment). I'd argue that the rarity > of leap seconds (both in terms of ratio of leap seconds to non-leap > seconds, and in terms of the number of applications that really have > to care about leap seconds) would make for late-surfacing bugs in > application code; so I'd personally prefer to see an 'opt-in' version > of leap second support, rather than 'opt-out'. Still, I'm largely > guessing here, so take with a large pinch of salt. > This is what I'm leaning towards now, too. Instead of UTC vs TAI, I think it's POSIX vs TAI and just for the datetime type, not the timedelta. So I'm changing my proposal to M8[s:POSIX] and M8[s:TAI] as the dtypes, with POSIX being the default if nothing is specified. Thanks, Mark (If this were a Python mailing list, I'd trot out 'Practicality Beats > Purity' from the Zen here.) > > > I found it disconcerting that > > datetime.datetime.now() returns local time with no time zone indication > on > > it. > > Yes, I agree that that's disturbing. datetime support for time zones > isn't the best. > > Thanks for the detailed response. I'll go back to lurking now. > > Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Tue Jun 7 11:32:24 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Tue, 7 Jun 2011 08:32:24 -0700 Subject: [Numpy-discussion] [job] Python Job at Hedge Fund Message-ID: We are looking for help to predict tomorrow's stock returns. The challenge is model selection in the presence of noisy data. The tools are ubuntu, python, cython, c, numpy, scipy, la, bottleneck, git. A quantitative background and experience or interest in model selection, machine learning, and software development are a plus. This is a full time position in Berkeley, California, two blocks from UC Berkeley. If you are interested send a CV or similar (or questions) to '.'.join(['htiek','scitylanayelekreb at namdoog','moc'][::-1])[::-1] From robert.kern at gmail.com Tue Jun 7 11:54:58 2011 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 7 Jun 2011 10:54:58 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld wrote: > I'm not convinced about the events concept - it seems to add complexity > for something which could be accomplished better in other ways. A [Y]//4 > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There > may well be a good reason for it however I can't see the need for it in my > own applications. Well, [D/100] doesn't represent [864s]. It represents something that happens 100 times a day, but not necessarily at precise regular intervals. For example, suppose that I am representing payments that happen twice a month, say on the 1st and 15th of every month, or the 5th and 20th. I would use [M/2] to represent that. It's not [2W], and it's not [15D]. It's twice a month. The default conversions may seem to imply that [D/100] is equivalent to [864s], but they are not intended to. They are just a starting point for one to write one's own, more specific conversions. Similarly, we have default conversions from low frequencies to high frequencies defaulting to representing the higher precision event at the beginning of the low frequency interval. E.g. for days->seconds, we assume that the day is representing the initial second at midnight of that day. We then use offsets to allow the user to add more information to specify it more precisely. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From charlesr.harris at gmail.com Tue Jun 7 12:09:57 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 7 Jun 2011 10:09:57 -0600 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 9:54 AM, Robert Kern wrote: > On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld > wrote: > > > I'm not convinced about the events concept - it seems to add complexity > > for something which could be accomplished better in other ways. A [Y]//4 > > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There > > may well be a good reason for it however I can't see the need for it in > my > > own applications. > > Well, [D/100] doesn't represent [864s]. It represents something that > happens 100 times a day, but not necessarily at precise regular > intervals. For example, suppose that I am representing payments that > happen twice a month, say on the 1st and 15th of every month, or the > 5th and 20th. I would use [M/2] to represent that. It's not [2W], and > it's not [15D]. It's twice a month. > > The default conversions may seem to imply that [D/100] is equivalent > to [864s], but they are not intended to. They are just a starting > point for one to write one's own, more specific conversions. > Similarly, we have default conversions from low frequencies to high > frequencies defaulting to representing the higher precision event at > the beginning of the low frequency interval. E.g. for days->seconds, > we assume that the day is representing the initial second at midnight > of that day. We then use offsets to allow the user to add more > information to specify it more precisely. > > Ah, it's a new concept: DeltaTime as a container of events ;) And what did you do during Summer vacation? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Tue Jun 7 12:53:04 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Tue, 07 Jun 2011 09:53:04 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <4DE7B8C0.4040805@noaa.gov> <4DED1D86.5070304@noaa.gov> Message-ID: <4DEE5770.1090308@noaa.gov> Pierre GM wrote: > Using the ISO as reference, you have a good definition of months. Yes, but only one. there are others. For instance, the climate modelers like to use a calendar that has 360 days a year: 12 30 day months. That way they get something with the same timescale as months and years, but have nice, linear, easy to use units (differentiable, and all that). Mark Wiebe wrote: > Code Interpreted as > Y 12M, 52W, 365D > M 4W, 30D, 720h > > This is even self inconsistent: > > 1Y == 365D > > 1Y == 12M == 12 * 30D == 360D > > 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D > > 1Y == 52W == 52 * 7D == 364D > > Is it not clear from this what a mess of mis-interpretation might result > from all that? > > > This part of the code is used for mapping metadata like [Y/4] -> [3M], > or [Y/26] -> [2W]. I agree that this '/' operator in the unit metadata > is weird, and wouldn't object to removing it. Weird, dangerous, and unnecessary. I can see how some data may be on, for example quarters, but that should require a definition of quarters that's more defined. > This goes to heck is the data is expressed in something like "months > since 1995-01-01" > > Because months are only defined on a Calendar. > > > Here's what the current implementation can do with that one: > > >>> np.datetime64('1995-01-01', 'M') + 13 > numpy.datetime64('1996-02','M') I see -- I have a better idea of the intent here, and I can see that as long as you keep everything in the same unit (say, months, in this case), then this can be a clean and effective way to deal with this sort of data. As I said, the netcdf case is a different use case, but I think the issue there was that the creator of the data was thinking of it as being used like above: "months since January, 1995", and the data was all integer values for months, it makes perfect sense, and is well defined. The problem in that case is that the standard does not have a specification that enforces that the units stay months, and that the intervals are integers -- so software looked at that, converted it to, for example, python datetime instances, using some pre-defined definition for the length of a month), and gt something that mis-represented the data. The numpy use-case is different, but it's my concern that that same kind of thing could easily happen, because people want to write generic code that deals with arbitrary np.datetime64 instances. I suppose we could consider this analogous to issues with integer an floating point dtypes -- when you convert between those, it's user-beware, but I think that would be more clear if we had a set of dtypes: datetime_months datetime_hours datetime_seconds But that list would get big in a hurry! Also, with the Python datetime module, for instance, what I like about it is that I don't have to know or care how it's stored internally -- all I need to know is what range and precision it can deal with. numpy has performance issues that may not make that possible, but I still like it. maybe two types: datetime_calendar: for Calendar-type units (months, business days, ...) datetime_continuous: for "linear units" (seconds, hours, ...) or something like that? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Tue Jun 7 13:09:05 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 12:09:05 -0500 Subject: [Numpy-discussion] merging datetime progress Message-ID: I'd like to merge the changes I've done to datetime so far into NumPy master today. It's a fairly big chunk of changes, with the following highlights: * Abstract the ufunc type resolution into a replaceable function, for datetime's type rules * Remove the TimeInteger scalar type, make timedelta64 subclass from signedint, and datetime64 subclass from the base scalar type * Get the datetime functionality largely following the NEP. * Partially remove the old API. If anyone has time to review the changes, I would still greatly appreciate feedback after merging just as much as before. :) Thanks, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.hirschfeld at gmail.com Tue Jun 7 13:41:55 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Tue, 7 Jun 2011 17:41:55 +0000 (UTC) Subject: [Numpy-discussion] fixing up datetime References: Message-ID: Robert Kern gmail.com> writes: > > On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld gmail.com> wrote: > > > I'm not convinced about the events concept - it seems to add complexity > > for something which could be accomplished better in other ways. A [Y]//4 > > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There > > may well be a good reason for it however I can't see the need for it in my > > own applications. > > Well, [D/100] doesn't represent [864s]. It represents something that > happens 100 times a day, but not necessarily at precise regular > intervals. For example, suppose that I am representing payments that > happen twice a month, say on the 1st and 15th of every month, or the > 5th and 20th. I would use [M/2] to represent that. It's not [2W], and > it's not [15D]. It's twice a month. > > The default conversions may seem to imply that [D/100] is equivalent > to [864s], but they are not intended to. They are just a starting > point for one to write one's own, more specific conversions. > Similarly, we have default conversions from low frequencies to high > frequencies defaulting to representing the higher precision event at > the beginning of the low frequency interval. E.g. for days->seconds, > we assume that the day is representing the initial second at midnight > of that day. We then use offsets to allow the user to add more > information to specify it more precisely. > That would be one way of dealing with irregularly spaced data. I would argue that the example is somewhat back-to-front though. If something happens twice a month it's not occuring at a monthly frequency, but at a higher frequency. In this case the lowest frequency which can capture this data is daily frequency so it sould be stored at daily frequency and if monthly statistics are required the series can be aggregated up after the fact. e.g. In [2]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M') In [3]: dates = dates.asfreq('D','START') In [4]: dates = (dates[:,None] + np.tile(array([4, 19]), [12, 1])).ravel() In [5]: data = 100 + 10*randn(12, 2) In [6]: payments = ts.time_series(data.ravel(), dates) In [7]: payments Out[7]: timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.12588909 89.86413807 94.89200485 93.69989375 103.37375202 104.7628273 97.45956699 93.39594431 94.79258639 102.90656477 87.42346985 91.43556069 95.21947628 93.0671271 107.07400065 92.0835356 94.11035154 86.66521318 109.36556861 101.69789341], dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011], freq = D) In [8]: payments.convert('M', ma.count) Out[8]: timeseries([2 2 2 2 2 2 2 2 2 2 2 2], dates = [Jan-2011 ... Dec-2011], freq = M) In [9]: payments.convert('M', ma.sum) Out[9]: timeseries([205.061556202 193.009420163 191.990027161 188.591898598 208.136579315 190.855511303 197.699151161 178.859030538 188.286603379 199.157536259 180.775564724 211.063462017], dates = [Jan-2011 ... Dec-2011], freq = M) Alternatively for a fixed number of events per-period the values can just be stored in a 2D array - e.g. In [10]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M') In [11]: payments = ts.time_series(data, dates) In [12]: payments Out[12]: timeseries( [[ 103.76588849 101.29566771] [ 91.10363573 101.90578443] [ 102.12588909 89.86413807] [ 94.89200485 93.69989375] [ 103.37375202 104.7628273 ] [ 97.45956699 93.39594431] [ 94.79258639 102.90656477] [ 87.42346985 91.43556069] [ 95.21947628 93.0671271 ] [ 107.07400065 92.0835356 ] [ 94.11035154 86.66521318] [ 109.36556861 101.69789341]], dates = [Jan-2011 ... Dec-2011], freq = M) In [13]: payments.sum(1) Out[13]: timeseries([ 205.0615562 193.00942016 191.99002716 188.5918986 208.13657931 190.8555113 197.69915116 178.85903054 188.28660338 199.15753626 180.77556472 211.06346202], dates = [Jan-2011 ... Dec-2011], freq = M) It seems to me that either of these would satisfy the use-case with the added benefit of simplifying the datetime implementation. That said I'm not against the proposal if it provides someone with some benefit... Regarding the default conversions, the start of the interval is a perfectly acceptable default (and my preferred choice) however being able to specify the end is also useful as for the month-day conversion the end of each month can't be specified by a fixed offset from the start because of their varying lengths. Of course this can be found by subtracting 1 from the start of the next month: (M + 1).asfreq('D',offset=0) - 1 but just as it's easier to write List[-1] rather than List[List.length -1] it's easier to write M.asfreq('D',offset=-1) Regards, Dave From ralf.gommers at googlemail.com Tue Jun 7 14:41:34 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Tue, 7 Jun 2011 20:41:34 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Mon, Jun 6, 2011 at 6:56 PM, Mark Wiebe wrote: > On Mon, Jun 6, 2011 at 10:30 AM, Mark Wiebe wrote: > >> On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers > > wrote: >> >>> On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: >>> >>>> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >>>> gael.varoquaux at normalesup.org> wrote: >>>> >>>>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>>>> > Would anyone object to, at least temporarily, tightening up the >>>>> default >>>>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one line >>>>> change, >>>>> > so would be easy to undo, but such a change is very desirable in >>>>> my >>>>> > opinion. >>>>> > This would raise an exception, since it's np.add(a, 1.9, out=a), >>>>> > converting a float to an int: >>>>> >>>>> > >>> a = np.arange(3, dtype=np.int32) >>>>> >>>>> > >>> a += 1.9 >>>>> >>>>> That's probably going to break a huge amount of code which relies on >>>>> the >>>>> current behavior. >>>>> >>>>> Am I right in believing that this should only be considered for a major >>>>> release of numpy, say numpy 2.0? >>>> >>>> >>>> Absolutely, and that's why I'm proposing to do it in master now, fairly >>>> early in a development cycle, so we can evaluate its effects. If the next >>>> version is 1.7, we probably would roll it back for release (a 1 line >>>> change), and if the next version is 2.0, we probably would keep it in. >>>> >>>> I suspect at least some of the code relying on the current behavior may >>>> have bugs, and tightening this up is a way to reveal them. >>>> >>>> >>> Here are some results of testing your tighten_casting branch on a few >>> projects - no need to first put it in master first to do that. Four failures >>> in numpy, two in scipy, four in scikit-learn (plus two that don't look >>> related), none in scikits.statsmodels. I didn't check how many of them are >>> actual bugs. >>> >>> I'm not against trying out your change, but it would probably be good to >>> do some more testing first and fix the issues found before putting it in. >>> Then at least if people run into issues with the already tested packages, >>> you can just tell them to update those to latest master. >>> >> >> Cool, thanks for running those. I already took a chunk out of the NumPy >> failures. The ones_like function shouldn't really be a ufunc, but rather be >> like zeros_like and empty_like, but that's probably not something to change >> right now. The datetime-fixes type resolution change provides a mechanism to >> fix that up pretty easily. >> >> For Scipy, what do you think is the best way to resolve it? If NumPy 1.6 >> is the minimum version for the next scipy, I would add casting='unsafe' to >> the failing sqrt call. >> > > There's no reason to set the minimum required numpy to 1.6 AFAIK, and it's definitely not desirable. Ralf > I've updated the tighten_casting branch so it now passes all tests. For > masked arrays, this required changing some tests to not assume float -> int > casts are fine by default, but otherwise I fixed things by relaxing the > rules just where necessary. It now depends on the datetime-fixes branch, > which I would like to merge at its current point. > > -Mark > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 7 14:53:12 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 7 Jun 2011 12:53:12 -0600 Subject: [Numpy-discussion] Are the characters in numpy strings signed or unsigned? Message-ID: Because the C standard allows char to be either signed or unsigned and this affects sort order I believe we should pick one or the other for sorting. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 7 14:57:14 2011 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 7 Jun 2011 13:57:14 -0500 Subject: [Numpy-discussion] Are the characters in numpy strings signed or unsigned? In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 13:53, Charles R Harris wrote: > Because the C standard allows char to be either signed or unsigned and this > affects sort order I believe we should pick one or the other for sorting. > Thoughts? Unsigned for (at least notional) compatibility with the unicode dtype. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Tue Jun 7 14:59:37 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 13:59:37 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Tue, Jun 7, 2011 at 1:41 PM, Ralf Gommers wrote: > On Mon, Jun 6, 2011 at 6:56 PM, Mark Wiebe wrote: > >> On Mon, Jun 6, 2011 at 10:30 AM, Mark Wiebe wrote: >> >>> On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers < >>> ralf.gommers at googlemail.com> wrote: >>> >>>> On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: >>>> >>>>> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >>>>> gael.varoquaux at normalesup.org> wrote: >>>>> >>>>>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>>>>> > Would anyone object to, at least temporarily, tightening up the >>>>>> default >>>>>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one >>>>>> line change, >>>>>> > so would be easy to undo, but such a change is very desirable in >>>>>> my >>>>>> > opinion. >>>>>> > This would raise an exception, since it's np.add(a, 1.9, out=a), >>>>>> > converting a float to an int: >>>>>> >>>>>> > >>> a = np.arange(3, dtype=np.int32) >>>>>> >>>>>> > >>> a += 1.9 >>>>>> >>>>>> That's probably going to break a huge amount of code which relies on >>>>>> the >>>>>> current behavior. >>>>>> >>>>>> Am I right in believing that this should only be considered for a >>>>>> major >>>>>> release of numpy, say numpy 2.0? >>>>> >>>>> >>>>> Absolutely, and that's why I'm proposing to do it in master now, fairly >>>>> early in a development cycle, so we can evaluate its effects. If the next >>>>> version is 1.7, we probably would roll it back for release (a 1 line >>>>> change), and if the next version is 2.0, we probably would keep it in. >>>>> >>>>> I suspect at least some of the code relying on the current behavior may >>>>> have bugs, and tightening this up is a way to reveal them. >>>>> >>>>> >>>> Here are some results of testing your tighten_casting branch on a few >>>> projects - no need to first put it in master first to do that. Four failures >>>> in numpy, two in scipy, four in scikit-learn (plus two that don't look >>>> related), none in scikits.statsmodels. I didn't check how many of them are >>>> actual bugs. >>>> >>>> I'm not against trying out your change, but it would probably be good to >>>> do some more testing first and fix the issues found before putting it in. >>>> Then at least if people run into issues with the already tested packages, >>>> you can just tell them to update those to latest master. >>>> >>> >>> Cool, thanks for running those. I already took a chunk out of the NumPy >>> failures. The ones_like function shouldn't really be a ufunc, but rather be >>> like zeros_like and empty_like, but that's probably not something to change >>> right now. The datetime-fixes type resolution change provides a mechanism to >>> fix that up pretty easily. >>> >>> For Scipy, what do you think is the best way to resolve it? If NumPy 1.6 >>> is the minimum version for the next scipy, I would add casting='unsafe' to >>> the failing sqrt call. >>> >> >> > There's no reason to set the minimum required numpy to 1.6 AFAIK, and it's > definitely not desirable. > Ok, I think there are two ways to resolve this kind of error. One would be to add a condition testing for a version >= 1.6, and setting casting='unsafe' when that occurs, and the other would be to insert a call to .astype() to force the type. The former is probably preferable, to avoid the unnecessary copy. Does numpy provide the version in tuple form, so a version comparison like this can be done easily? numpy.version doesn't seem to have one in it. -Mark > > Ralf > > >> I've updated the tighten_casting branch so it now passes all tests. For >> masked arrays, this required changing some tests to not assume float -> int >> casts are fine by default, but otherwise I fixed things by relaxing the >> rules just where necessary. It now depends on the datetime-fixes branch, >> which I would like to merge at its current point. >> >> -Mark >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Tue Jun 7 15:08:31 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Tue, 7 Jun 2011 21:08:31 +0200 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Tue, Jun 7, 2011 at 8:59 PM, Mark Wiebe wrote: > On Tue, Jun 7, 2011 at 1:41 PM, Ralf Gommers wrote: > >> On Mon, Jun 6, 2011 at 6:56 PM, Mark Wiebe wrote: >> >>> On Mon, Jun 6, 2011 at 10:30 AM, Mark Wiebe wrote: >>> >>>> On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers < >>>> ralf.gommers at googlemail.com> wrote: >>>> >>>>> On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: >>>>> >>>>>> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >>>>>> gael.varoquaux at normalesup.org> wrote: >>>>>> >>>>>>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>>>>>> > Would anyone object to, at least temporarily, tightening up the >>>>>>> default >>>>>>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one >>>>>>> line change, >>>>>>> > so would be easy to undo, but such a change is very desirable in >>>>>>> my >>>>>>> > opinion. >>>>>>> > This would raise an exception, since it's np.add(a, 1.9, out=a), >>>>>>> > converting a float to an int: >>>>>>> >>>>>>> > >>> a = np.arange(3, dtype=np.int32) >>>>>>> >>>>>>> > >>> a += 1.9 >>>>>>> >>>>>>> That's probably going to break a huge amount of code which relies on >>>>>>> the >>>>>>> current behavior. >>>>>>> >>>>>>> Am I right in believing that this should only be considered for a >>>>>>> major >>>>>>> release of numpy, say numpy 2.0? >>>>>> >>>>>> >>>>>> Absolutely, and that's why I'm proposing to do it in master now, >>>>>> fairly early in a development cycle, so we can evaluate its effects. If the >>>>>> next version is 1.7, we probably would roll it back for release (a 1 line >>>>>> change), and if the next version is 2.0, we probably would keep it in. >>>>>> >>>>>> I suspect at least some of the code relying on the current behavior >>>>>> may have bugs, and tightening this up is a way to reveal them. >>>>>> >>>>>> >>>>> Here are some results of testing your tighten_casting branch on a few >>>>> projects - no need to first put it in master first to do that. Four failures >>>>> in numpy, two in scipy, four in scikit-learn (plus two that don't look >>>>> related), none in scikits.statsmodels. I didn't check how many of them are >>>>> actual bugs. >>>>> >>>>> I'm not against trying out your change, but it would probably be good >>>>> to do some more testing first and fix the issues found before putting it in. >>>>> Then at least if people run into issues with the already tested packages, >>>>> you can just tell them to update those to latest master. >>>>> >>>> >>>> Cool, thanks for running those. I already took a chunk out of the NumPy >>>> failures. The ones_like function shouldn't really be a ufunc, but rather be >>>> like zeros_like and empty_like, but that's probably not something to change >>>> right now. The datetime-fixes type resolution change provides a mechanism to >>>> fix that up pretty easily. >>>> >>>> For Scipy, what do you think is the best way to resolve it? If NumPy 1.6 >>>> is the minimum version for the next scipy, I would add casting='unsafe' to >>>> the failing sqrt call. >>>> >>> >>> >> There's no reason to set the minimum required numpy to 1.6 AFAIK, and it's >> definitely not desirable. >> > > Ok, I think there are two ways to resolve this kind of error. One would be > to add a condition testing for a version >= 1.6, and setting > casting='unsafe' when that occurs, and the other would be to insert a call > to .astype() to force the type. The former is probably preferable, to avoid > the unnecessary copy. > > Does numpy provide the version in tuple form, so a version comparison like > this can be done easily? numpy.version doesn't seem to have one in it. > > No, just as a string. I think it's fine to do: In [4]: np.version.short_version Out[4]: '2.0.0' In [5]: np.version.short_version > '1.5.1' Out[5]: True For a very convoluted version that produces a tuple see parse_numpy_version() in scipy.pavement.py Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Tue Jun 7 15:23:45 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Tue, 07 Jun 2011 12:23:45 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: <4DEE7AC1.30206@noaa.gov> Dave Hirschfeld wrote: > Robert Kern gmail.com> writes: >> Well, [D/100] doesn't represent [864s]. It represents something that >> happens 100 times a day, but not necessarily at precise regular >> intervals. >> For example, suppose that I am representing payments that >> happen twice a month, say on the 1st and 15th of every month, or the >> 5th and 20th. I would use [M/2] to represent that. It's not [2W], and >> it's not [15D]. It's twice a month. Got it -- very useful concept, of course. >> The default conversions may seem to imply that [D/100] is equivalent >> to [864s], but they are not intended to. They are just a starting >> point for one to write one's own, more specific conversions. Here is my issue -- I don't think there should be default conversions at all -- as you point out, "twice a month" is a well defined concept, but it is NOT the same as every 15 days (or any other set interval). I'm not sure it should be possible to "convert" to a regular interval at all. > That would be one way of dealing with irregularly spaced data. I would argue > that the example is somewhat back-to-front though. If something happens > twice a month it's not occuring at a monthly frequency, but at a higher > frequency. And that frequency is 2/month. > In this case the lowest frequency which can capture this data is > daily frequency so it should be stored at daily frequency I don't think it should, because it isn't 1/15 days, or, indeed, an frequency that can be specified as days. Sure you can specify the 5th and 20th of each month in a given time span in terms of days since an epoch, but you've lost some information there. When you want to do math -- like add a certain number of 1/2 months -- when is the 100th payment due? It seems keeping it in M/2 is the most natural way to deal with that -- then you don't need special code to do that addition, only when converting to a string (or other format) datetime. Anyway, my key point is that converting to/from calendar-based units and "linear time" units is fraught with peril -- it needs to be really clear when that is happening, and the user needs to have a clear and ideally easy way to define how it should happen. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From kwgoodman at gmail.com Tue Jun 7 16:17:43 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Tue, 7 Jun 2011 13:17:43 -0700 Subject: [Numpy-discussion] Returning the same dtype in Cython as np.argmax Message-ID: What is the rule to determine the dtype returned by numpy functions that return indices such as np.argmax? I assumed that np.argmax() returned the same dtype as np.int_. That works on linux32/64 and win32 but on win-amd64 np.int_ is int32 and np.argmax() returns int64. Someone suggested using intp. But I can't figure out how to do that in Cython since I am unable to cimport NPY_INTP. So this doesn't work: from numpy cimport NPY_INTP cdef np.ndarray[np.intp_t, ndim=1] output = PyArray_EMPTY(1, dims, NPY_INTP, 0) Is there another way? All I can think of is checking whether np.intp is np.int32 or int64 when I template the code. From mwwiebe at gmail.com Tue Jun 7 16:55:19 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 15:55:19 -0500 Subject: [Numpy-discussion] Changing the datetime operation unit rules Message-ID: The NEP for datetime specifies a set of rules for how units should behave when combined with + or -. After playing around with the implementation a bit, I think the rules need to be changed. First, subtracting two datetime64's should produce the more precise unit. This allows things like the following, to recover the current day offset in the current year: >>> a = np.datetime64('today') >>> a - a.astype('M8[Y]') numpy.timedelta64(157,'D') vs >>> a = np.datetime64('today') >>> a - a.astype('M8[Y]') Traceback (most recent call last): File "", line 1, in TypeError: ufunc subtract cannot use operands with types dtype('datetime64[D]') and dtype('datetime64[Y]') Similarly, a datetime64 +/- timedelta64 should also produce the more precise unit, because things like the following intuitively should produce a date, not truncate back to the year: >>> np.datetime64('2011') + np.timedelta64(3,'M') + np.timedelta64(5,'D') numpy.datetime64('2011-04-06','D') vs >>> np.datetime64('2011') + np.timedelta64(3,'M') + np.timedelta64(5,'D') numpy.datetime64('2011','Y') Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Tue Jun 7 16:56:13 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Tue, 7 Jun 2011 22:56:13 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Jun 7, 2011, at 5:54 PM, Robert Kern wrote: > On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld wrote: > >> I'm not convinced about the events concept - it seems to add complexity >> for something which could be accomplished better in other ways. A [Y]//4 >> dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There >> may well be a good reason for it however I can't see the need for it in my >> own applications. > > Well, [D/100] doesn't represent [864s]. It represents something that > happens 100 times a day, but not necessarily at precise regular > intervals. For example, suppose that I am representing payments that > happen twice a month, say on the 1st and 15th of every month, or the > 5th and 20th. I would use [M/2] to represent that. It's not [2W], and > it's not [15D]. It's twice a month. I understand that, that was how the concept was developed in the first place. I still wonder it's that necessary. I would imagine that a structured type could do the same job, without to much hassle (I'm thinking of a ('D',100), like you can have a (np.float,100), for example...) > The default conversions may seem to imply that [D/100] is equivalent > to [864s], but they are not intended to. Well, like Chris suggested (I think), we could prevent type conversion when the denominator is not 1... > They are just a starting > point for one to write one's own, more specific conversions. > Similarly, we have default conversions from low frequencies to high > frequencies defaulting to representing the higher precision event at > the beginning of the low frequency interval. E.g. for days->seconds, > we assume that the day is representing the initial second at midnight > of that day. We then use offsets to allow the user to add more > information to specify it more precisely. As Dave H. summarized, we used a basic keyword to do the same thing in scikits.timeseries, with the addition of some subfrequencies like A-SEP to represent a year starting in September, for example. It works, but it's really not elegant a solution. On Jun 7, 2011, at 6:53 PM, Christopher Barker wrote: > Pierre GM wrote: >> Using the ISO as reference, you have a good definition of months. > > Yes, but only one. there are others. For instance, the climate modelers > like to use a calendar that has 360 days a year: 12 30 day months. That > way they get something with the same timescale as months and years, but > have nice, linear, easy to use units (differentiable, and all that). Ah Chris... That's easy for climate models, but when you have to deal with actual station data ;) More seriously, that's the kind of specific case that could be achieved by subclassing an array with a standard unit (like, days). I do understand your arguments for not recognizing a standard Gregorian calendar month as a universal unit. Nevertheless, it can be converted to days without ambiguity in the ISO8601 system. From dave.hirschfeld at gmail.com Tue Jun 7 17:13:17 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Tue, 7 Jun 2011 21:13:17 +0000 (UTC) Subject: [Numpy-discussion] fixing up datetime References: <4DEE7AC1.30206@noaa.gov> Message-ID: Christopher Barker noaa.gov> writes: > > Dave Hirschfeld wrote: > > That would be one way of dealing with irregularly spaced data. I would argue > > that the example is somewhat back-to-front though. If something happens > > twice a month it's not occuring at a monthly frequency, but at a higher > > frequency. > > And that frequency is 2/month. > > > In this case the lowest frequency which can capture this data is > > daily frequency so it should be stored at daily frequency > > I don't think it should, because it isn't 1/15 days, or, indeed, an > frequency that can be specified as days. Sure you can specify the 5th > and 20th of each month in a given time span in terms of days since an > epoch, but you've lost some information there. When you want to do math > -- like add a certain number of 1/2 months -- when is the 100th payment due? > It seems keeping it in M/2 is the most natural way to deal with that > -- then you don't need special code to do that addition, only when > converting to a string (or other format) datetime. With a monthly frequency you can't represent "2/month" except using a 2D array as in my second example. This does however discard the information about when in the month the payments occurred/are due. For that matter I'm not sure where the information that the events occur on the 5th and the 20th is recorded in the dtype [M]//2? Maybe my mental model is too fixed in the scikits.timeseries mode and I need to try out the new code... You're right that 2 per month isn't a daily frequency however as you noted the data can be stored in a daily frequency array with missing values for each date where a payment doesn't occur. The timeseries package uses a masked array to represent the missing data however you aren't required to have data for every day, only when you call the .fill_missing_dates() function is the array expanded to include a datapoint for every period. As shown below the timeseries is at a daily frequency but only datapoints for the 5th and 20th of the month are included: In [21]: payments Out[21]: timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.125889 89.86413807 94.89200485 93.69989375 103.37375202 104.7628273 97.45956699 93.39594431 94.79258639 102.90656477 87.42346985 91.43556069 95.21947628 93.0671271 107.07400065 92.0835356 94.11035154 86.66521318 109.36556861 101.69789341], dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011], freq = D) If I want to see when the 5th payment is due I can simply do: In [26]: payments[4:5] Out[26]: timeseries([ 102.12588909], dates = [05-Mar-2011], freq = D) Advancing the payments by a fixed number of days is possible: In [28]: payments.dates[:] += 3 In [29]: payments.dates Out[29]: DateArray([08-Jan-2011, 23-Jan-2011, 08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011], freq='D') Starting 3 payments in the future is more difficult and would require the date array to be recreated with the new starting date. One way of doing that would be: In [42]: dates = ts.date_array(payments.dates[2], length=(31*payments.size)//2) In [43]: dates = dates[(dates.day == 8) | (dates.day == 23)][0:payments.size] In [44]: dates Out[44]: DateArray([08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011, 08-Jan-2012, 23-Jan-2012], freq='D') In [45]: dates.shape Out[45]: (24,) A bit messy - I'll have to look at the numpy implementation to how it improves the situation... Regards, Dave From amcmorl at gmail.com Tue Jun 7 17:23:12 2011 From: amcmorl at gmail.com (Angus McMorland) Date: Tue, 7 Jun 2011 17:23:12 -0400 Subject: [Numpy-discussion] git source datetime build error Message-ID: Hi all, I'm experiencing a build failure from current numpy git, which seems to be datetime related. Here's the error message. creating build/temp.linux-i686-2.6/numpy/core/src/multiarray compile options: '-Inumpy/core/include -Ibuild/src.linux-i686-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 -Ibuild/src.linux-i686-2.6/numpy/core/src/multiarray -Ibuild/src.linux-i686-2.6/numpy/core/src/umath -c' gcc: numpy/core/src/multiarray/multiarraymodule_onefile.c In file included from numpy/core/src/multiarray/scalartypes.c.src:24, from numpy/core/src/multiarray/multiarraymodule_onefile.c:10: numpy/core/src/multiarray/_datetime.h:5: warning: function declaration isn?t a prototype In file included from numpy/core/src/multiarray/multiarraymodule_onefile.c:13: numpy/core/src/multiarray/datetime.c:23: warning: function declaration isn?t a prototype numpy/core/src/multiarray/datetime.c:35: error: redefinition of ?_datetime_strings? numpy/core/src/multiarray/scalartypes.c.src:690: note: previous definition of ?_datetime_strings? was here In file included from numpy/core/src/multiarray/scalartypes.c.src:24, from numpy/core/src/multiarray/multiarraymodule_onefile.c:10: numpy/core/src/multiarray/_datetime.h:5: warning: function declaration isn?t a prototype In file included from numpy/core/src/multiarray/multiarraymodule_onefile.c:13: numpy/core/src/multiarray/datetime.c:23: warning: function declaration isn?t a prototype numpy/core/src/multiarray/datetime.c:35: error: redefinition of ?_datetime_strings? numpy/core/src/multiarray/scalartypes.c.src:690: note: previous definition of ?_datetime_strings? was here error: Command "gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Inumpy/core/include -Ibuild/src.linux-i686-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 -Ibuild/src.linux-i686-2.6/numpy/core/src/multiarray -Ibuild/src.linux-i686-2.6/numpy/core/src/umath -c numpy/core/src/multiarray/multiarraymodule_onefile.c -o build/temp.linux-i686-2.6/numpy/core/src/multiarray/multiarraymodule_onefile.o" failed with exit status 1 Thanks, Angus -- AJC McMorland Post-doctoral research fellow Neurobiology, University of Pittsburgh From dave.hirschfeld at gmail.com Tue Jun 7 17:47:24 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Tue, 7 Jun 2011 21:47:24 +0000 (UTC) Subject: [Numpy-discussion] Changing the datetime operation unit rules References: Message-ID: Mark Wiebe gmail.com> writes: > > >>> a = np.datetime64('today') > > >>> a - a.astype('M8[Y]') > > numpy.timedelta64(157,'D') > > vs > > > >>> a = np.datetime64('today') > >>> a - a.astype('M8[Y]') > Traceback (most recent call last): > File "", line 1, in > TypeError: ufunc subtract cannot use operands with types dtype('datetime64[D]') and dtype('datetime64[Y]') > > > Cheers, > Mark > > I think I like the latter behaviour actually. The first one implicitly assumes I want the yearly frequency converted to the start of the period. I think the user should have to explicitly convert the dates to a common representation before performing arithmetic on them. It seems to me that if you didn't want your data to be silently cast to the start of the period it would be very difficult to determine if this had happened or not. As an example consider the pricing of a stock using the black-scholes formula. The value is dependant on the time to maturity which is the expiry date minus the current date. If the expiry date was accidentally specified at (or converted to) a monthly frequency and the current date was a daily frequency taking the difference will give you the wrong number of days until expiry resulting in a subtly different answer - not good. NB: The timeseries Date/DateArray have a day_of_year attribute which is very useful. Regards, Dave From mwwiebe at gmail.com Tue Jun 7 17:57:26 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 16:57:26 -0500 Subject: [Numpy-discussion] git source datetime build error In-Reply-To: References: Message-ID: Hi Angus, Thanks for reporting that, I've committed a fix so it builds in the monolithic mode again. -Mark On Tue, Jun 7, 2011 at 4:23 PM, Angus McMorland wrote: > Hi all, > > I'm experiencing a build failure from current numpy git, which seems > to be datetime related. Here's the error message. > > creating build/temp.linux-i686-2.6/numpy/core/src/multiarray > compile options: '-Inumpy/core/include > -Ibuild/src.linux-i686-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray > -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 > -Ibuild/src.linux-i686-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-i686-2.6/numpy/core/src/umath -c' > gcc: numpy/core/src/multiarray/multiarraymodule_onefile.c > In file included from numpy/core/src/multiarray/scalartypes.c.src:24, > from > numpy/core/src/multiarray/multiarraymodule_onefile.c:10: > numpy/core/src/multiarray/_datetime.h:5: warning: function declaration > isn?t a prototype > In file included from > numpy/core/src/multiarray/multiarraymodule_onefile.c:13: > numpy/core/src/multiarray/datetime.c:23: warning: function declaration > isn?t a prototype > numpy/core/src/multiarray/datetime.c:35: error: redefinition of > ?_datetime_strings? > numpy/core/src/multiarray/scalartypes.c.src:690: note: previous > definition of ?_datetime_strings? was here > In file included from numpy/core/src/multiarray/scalartypes.c.src:24, > from > numpy/core/src/multiarray/multiarraymodule_onefile.c:10: > numpy/core/src/multiarray/_datetime.h:5: warning: function declaration > isn?t a prototype > In file included from > numpy/core/src/multiarray/multiarraymodule_onefile.c:13: > numpy/core/src/multiarray/datetime.c:23: warning: function declaration > isn?t a prototype > numpy/core/src/multiarray/datetime.c:35: error: redefinition of > ?_datetime_strings? > numpy/core/src/multiarray/scalartypes.c.src:690: note: previous > definition of ?_datetime_strings? was here > error: Command "gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv > -O2 -Wall -Wstrict-prototypes -fPIC -Inumpy/core/include > -Ibuild/src.linux-i686-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray > -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 > -Ibuild/src.linux-i686-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-i686-2.6/numpy/core/src/umath -c > numpy/core/src/multiarray/multiarraymodule_onefile.c -o > > build/temp.linux-i686-2.6/numpy/core/src/multiarray/multiarraymodule_onefile.o" > failed with exit status 1 > > Thanks, > > Angus > -- > AJC McMorland > Post-doctoral research fellow > Neurobiology, University of Pittsburgh > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 7 19:16:25 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 18:16:25 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: Hi Dave, Thanks for all the feedback on the datetime, it's very useful to help understand the timeseries ideas, in particular with the many examples you're sprinkling in. One overall impression I have about timeseries in general is the use of the term "frequency" synonymously with the time unit. To me, a frequency is a numerical quantity with a unit of 1/(time unit), so while it's related to the time unit, naming it the same is something the specific timeseries domain has chosen to do, I think the numpy datetime class shouldn't have anything called "frequency" in it, and I would like to remove the current usage of that terminology from the codebase. In Wes's comment, he said I'm hopeful that the datetime64 dtype will enable scikits.timeseries > and pandas to consolidate much ofir the datetime / frequency code. > scikits.timeseries has a ton of great stuff for generating dates with > all the standard fixed frequencies. implying to me that the important functionality needed in time series is the ability to generate arrays of dates in specific ways. I suspect equating the specification of the array of dates and the unit of precision used to store the date isn't good for either the datetime functionality or supporting timeseries, and I'm presently trying to understand what it is that timeseries uses. On Tue, Jun 7, 2011 at 7:34 AM, Dave Hirschfeld wrote: > As a user of numpy/scipy in finance I thought I would put in my 2p worth as > it's something which is of great importance in this area. > > I'm currently a heavy user of the scikits.timeseries package by Matt & > Pierre > and I'm also following the development of statsmodels and pandas should we > require more sophisticated statistics in future. Hopefully the numpy > datetime > type will provide a foundation such packages can build upon... > > I'll use the timeseries package for reference since I'm most familiar with > it > and it's a very good api for my requirements. Apologies to Matt/Pierre if > I get anything wrong - feel free to correct my misconceptions... > > I think some of the complexity is coming from the definition of the > timedelta. > In the timeseries package each date simply represents the number of periods > since the epoch and the difference between dates is therefore just and > integer > with no attached metadata - its meaning is determined by the context it's > used > in. e.g. > > In [56]: M1 = ts.Date('M',"01-Jan-2011") > > In [57]: M2 = ts.Date('M',"01-Jan-2012") > > In [58]: M2 - M1 > Out[58]: 12 > > timeseries gets on just fine without a timedelta type - a timedelta is just > an > integer and if you add an integer to a date it's interpreted as the number > of > periods of that dates frequency. From a useability point of view M1 + 1 is > much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1). > I think the timedelta is important, especially with the large number of units NumPy's datetime supports. When you're subtracting two nanosecond datetimes and two minute datetimes in the same code, having the units there to avoid confusion is pretty useful. Ideally, timedelta would just be a regular integer or float with a time unit associated, but NumPy doesn't have a physical units system integrated at present. > Something like the dateutil relativedelta pacage is very convenient and > could serve as a template for such functionality: > > In [59]: from dateutil.relativedelta import relativedelta > > In [60]: (D1 + 30).datetime > Out[60]: datetime.datetime(2011, 1, 31, 0, 0) > > In [61]: (D1 + 30).datetime + relativedelta(months=1) > Out[61]: datetime.datetime(2011, 2, 28, 0, 0) > > ...but you can still get the same behaviour without a timedelta by asking > that > the user explicitly specify what they mean by "adding one month" to a date > of > a different frequency. e.g. > > In [62]: (D1 + 30) > Out[62]: > > In [63]: _62.asfreq('M') + 1 > Out[63]: > > In [64]: (_62.asfreq('M') + 1).asfreq('D','END') > Out[64]: > > In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day > Out[65]: > I don't envision 'asfreq' being a datetime function, this is the kind of thing that would layer on top in a specialized timeseries library. The behavior of timedelta follows a more physics-like idea with regard to the time unit, and I don't think something more complicated belongs at the bottom layer that is shared among all datetime uses. Here's a rough approximation of your calculations above: >>> d = np.datetime64('2011-01-31') >>> d.astype('M8[M]') + 1 numpy.datetime64('2011-02','M') >>> (d.astype('M8[M]') + 2) - np.timedelta64(1, 'D') numpy.datetime64('2011-02-28','D') >>> (d.astype('M8[M]') + 1) + np.timedelta64(31, 'D') numpy.datetime64('2011-03-04','D') As Pierre noted when converting dates from a lower frequency to a higher one > it's very useful (essential!) to be able to specify whether you want the > end > or the start of the interval. It may also be useful to be able to specify > an > arbitrary offset from either the start or the end of the interval so you > could > do something like: > > In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0) > Out[66]: > > In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1) > Out[67]: > > In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15) > Out[68]: > I think this kind of functionality belongs at a higher level, but the idea is to make it reasonable to implement it with the NumPy datetime primitives: >>> (d.astype('M8[M]') + 1).astype('M8[D]') numpy.datetime64('2011-02-01','D') >>> ((d.astype('M8[M]') + 1) + 1) - np.timedelta64(1, 'D') numpy.datetime64('2011-02-28','D') >>> (d.astype('M8[M]') + 1) + np.timedelta64(15, 'D') numpy.datetime64('2011-02-16','D') I don't think it's useful to define higher 'frequencies' as arbitrary > multiples > of lower 'frequencies' unless the conversion is exact otherwise it leads > to the following inconsistencies: > > In [69]: days_per_month = 30 > > In [70]: D1 = M1.asfreq('D',relation='START') > > In [71]: D2 = M2.asfreq('D','START') > > In [72]: D1, D2 > Out[72]: (, ) > > In [73]: D1 + days_per_month*(M2 - M1) > Out[73]: > > In [74]: D1 + days_per_month*(M2 - M1) == D2 > Out[74]: False > Here's what I get: >>> d1, d2 = np.datetime64('2011-01-01'), np.datetime64('2012-01-01') >>> m1, m2 = d1.astype('M8[M]'), d2.astype('M8[M]') >>> d1 + 30 * (m2 - m1) Traceback (most recent call last): File "", line 1, in TypeError: Cannot get a common metadata divisor for types dtype('datetime64[D]') and dtype('timedelta64[M]') because they have incompatible nonlinear base time units >>> d1 + 30 * (m2 - m1).astype('i8') numpy.datetime64('2011-12-27','D') > If I want the number of days between M1 and M2 I explicitely do the > conversion > myself: > > In [75]: M2.asfreq('D','START') - M1.asfreq('D','START') > Out[75]: 365 > > thus avoiding any inconsistency: > > In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2 > Out[76]: True > > I'm not convinced about the events concept - it seems to add complexity > for something which could be accomplished better in other ways. A [Y]//4 > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There > may well be a good reason for it however I can't see the need for it in my > own applications. > > In the timeseries package, because the difference between dates represents > the > number of periods between the dates they must be of the same frequency to > unambiguopusly define what a "period" means: > > In [77]: M1 - D1 > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > > C:\dev\code\ in () > > ValueError: Cannot subtract Date objects with different frequencies. > > I would argue that in the spirit that we're all consenting adults > adding dates of the same frequency can be a useful thing for example > in finding the mid-point between two dates: > > In [78]: M1.asfreq('S','START') > Out[78]: > > In [79]: M2.asfreq('S','START') > Out[79]: > > In [80]: ts.Date('S', (_64.value + _65.value)//2) > Out[80]: > Adding dates definitely doesn't work, because datetimes have no zero, but I would express it like this: >>> s1, s2 = m1.astype('M8[s]'), m2.astype('M8[s]') >>> s1 + (s2 - s1)/2 numpy.datetime64('2011-07-02T07:00:00-0500','s') >>> np.datetime_as_string(s1 + (s2 - s1)/2) '2011-07-02T12:00:00Z' Printing times in the local timezone by default makes that first printout a bit weird, but I really like having that default so this looks good: >>> np.datetime64('now') numpy.datetime64('2011-06-07T18:15:46-0500','s') I think any errors which arose from adding or multiplying dates would be > pretty > easy to spot in your code. > > As Robert mentioned the economic data we use is often supplied as weekly, > monthly, quarterly or annual data. So these frequencies are critical if > we're > to use the the array as a container for economic data. Such data would > usually > represent either the sum or the average over that period so it's very easy > to > get a consistent "linear-time" representation by interpolating down to a > higher > frequency such as daily or hourly. > > I really like the idea of being able to specify multiples of the base > frequency > - e.g. [7D] is equivalenty to [W] not the least because it provides an easy > way to specify quarters [3M] or seasons [6M] which are important in my > work. > NB: I also deal with half-hourly and quarter-hourly timeseries and I'm sure > there are many other example which are all made possible by allowing > multipliers. > > One aspect of this is that the origin becomes important - i.e. does the > week > [7D] start on Monday/Tuesday etc. In scikits.timeseries this is solved by > defining a different weekly frequency for each day of the week, a different > annual frequency starting at each month etc... > > http://pytseries.sourceforge.net/core.constants.html I'm thinking however that it may be possible to use the > origin/zero-date/epoch > attribute to define the start of such periods - e.g. if you had a weekly > [7D] > frequency and the origin was 01-Jan-1970 then each week would be defined as > a > Thursday-Thursday week. To get a Monday-Monday week you could supply > 05-Jan-1970 as the origin attribute. > This is one of the things where I think mixing the datetime storage precision with timeseries frequency seems counterproductive. Having different origins for datetime64 starting on different weekdays near 1970-01-01 doesn't seem like the right way to tackle the problem to me. I see other valid reasons for reintroducing the origin metadata, but this one I don't really like. > > Unfortunately business days and holidays are also very important in > finance, > however I agree that this may be better suited to a calendar API. I would > suggest that leap seconds would be something which could also be handled by > this API rather than having such complex functionality baked in by default. > I've got a business day API in development, and will post it for feedback soon. I'm not sure how this could be implemented in practice except for some vague > thoughts about providing hooks where users could provide functions which > converted to and from an integer representation for their particular > calendar. Creating a weekday calendar would be a good test-case for such > an API. > > Apologies for the very long post! I guess it can be mostly summarised as > you've got a pretty good template for functionality in scikits.timeseries! > Pandas/statsmodels may have more sophisticated requirements though so their > input on the finance/econometric side would be useful... > Thanks again for the feedback! -Mark > > -Dave > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 7 19:24:38 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 18:24:38 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 3:56 PM, Pierre GM wrote: > > On Jun 7, 2011, at 5:54 PM, Robert Kern wrote: > > > On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld > wrote: > > > >> I'm not convinced about the events concept - it seems to add complexity > >> for something which could be accomplished better in other ways. A [Y]//4 > >> dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There > >> may well be a good reason for it however I can't see the need for it in > my > >> own applications. > > > > Well, [D/100] doesn't represent [864s]. It represents something that > > happens 100 times a day, but not necessarily at precise regular > > intervals. For example, suppose that I am representing payments that > > happen twice a month, say on the 1st and 15th of every month, or the > > 5th and 20th. I would use [M/2] to represent that. It's not [2W], and > > it's not [15D]. It's twice a month. > > I understand that, that was how the concept was developed in the first > place. I still wonder it's that necessary. I would imagine that a structured > type could do the same job, without to much hassle (I'm thinking of a > ('D',100), like you can have a (np.float,100), for example...) > It appears to me that a structured dtype with some further NumPy extensions could entirely replace the 'events' metadata fairly cleanly. If the ufuncs are extended to operate on structured arrays, and integers modulo n are added as a new dtype, a dtype like [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current 'M8[D]//100'. > The default conversions may seem to imply that [D/100] is equivalent > > to [864s], but they are not intended to. > > Well, like Chris suggested (I think), we could prevent type conversion when > the denominator is not 1... > I'd like to remove the '/100' functionality here, I don't think it gives any benefit. Saying 'M8[3M]' is better than 'M8[Y/4]', in my opinion, and maybe adding 'M8[Q]' and removing the number in front of the unit would be good too... > > They are just a starting > > point for one to write one's own, more specific conversions. > > Similarly, we have default conversions from low frequencies to high > > frequencies defaulting to representing the higher precision event at > > the beginning of the low frequency interval. E.g. for days->seconds, > > we assume that the day is representing the initial second at midnight > > of that day. We then use offsets to allow the user to add more > > information to specify it more precisely. > > As Dave H. summarized, we used a basic keyword to do the same thing in > scikits.timeseries, with the addition of some subfrequencies like A-SEP to > represent a year starting in September, for example. It works, but it's > really not elegant a solution. > This kind of thing definitely belongs in a layer above datetime. -Mark > On Jun 7, 2011, at 6:53 PM, Christopher Barker wrote: > > > Pierre GM wrote: > >> Using the ISO as reference, you have a good definition of months. > > > > Yes, but only one. there are others. For instance, the climate modelers > > like to use a calendar that has 360 days a year: 12 30 day months. That > > way they get something with the same timescale as months and years, but > > have nice, linear, easy to use units (differentiable, and all that). > > > Ah Chris... That's easy for climate models, but when you have to deal with > actual station data ;) More seriously, that's the kind of specific case that > could be achieved by subclassing an array with a standard unit (like, days). > I do understand your arguments for not recognizing a standard Gregorian > calendar month as a universal unit. Nevertheless, it can be converted to > days without ambiguity in the ISO8601 system. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fperez.net at gmail.com Tue Jun 7 19:27:31 2011 From: fperez.net at gmail.com (Fernando Perez) Date: Tue, 7 Jun 2011 16:27:31 -0700 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: Hi Mark, On Tue, Jun 7, 2011 at 10:09 AM, Mark Wiebe wrote: > I'd like to merge the changes I've done to datetime so far into NumPy master > today. It's a fairly big chunk of changes, with the following highlights: Is it this pull request: https://github.com/numpy/numpy/pull/85 or is there more work elsewhere? f From mwwiebe at gmail.com Tue Jun 7 19:33:46 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 18:33:46 -0500 Subject: [Numpy-discussion] Changing the datetime operation unit rules In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 4:47 PM, Dave Hirschfeld wrote: > Mark Wiebe gmail.com> writes: > > > > > >>> a = np.datetime64('today') > > > > >>> a - a.astype('M8[Y]') > > > > numpy.timedelta64(157,'D') > > > > vs > > > > > > >>> a = np.datetime64('today') > > >>> a - a.astype('M8[Y]') > > Traceback (most recent call last): > > File "", line 1, in > > TypeError: ufunc subtract cannot use operands with types > dtype('datetime64[D]') and dtype('datetime64[Y]') > > > > > > Cheers, > > Mark > > > > > > I think I like the latter behaviour actually. The first one implicitly > assumes > I want the yearly frequency converted to the start of the period. I think > the > user should have to explicitly convert the dates to a common representation > before performing arithmetic on them. > I'm thinking of a datetime as an infinitesimal moment in time, with the unit representing the precision that can be stored. Thus, '2011', '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in time, but with a different unit of precision. Computationally, this perspective is turning out to provide a pretty rich mechanism to do operations on date. A time period is a different beast, and conceptually consists of a (datetime, timedelta) tuple or two sequential datetimes together. It seems to me that if you didn't want your data to be silently cast to the > start of the period it would be very difficult to determine if this had > happened or not. > > As an example consider the pricing of a stock using the black-scholes > formula. > The value is dependant on the time to maturity which is the expiry date > minus > the current date. If the expiry date was accidentally specified at (or > converted > to) a monthly frequency and the current date was a daily frequency taking > the > difference will give you the wrong number of days until expiry resulting in > a > subtly different answer - not good. > A timeseries library layer on top of datetime should be able to do appropriate type checking to ensure things are done safely. If that's not the case with the current design, I think the functionality that would enable that should be explored. NB: The timeseries Date/DateArray have a day_of_year attribute which is very > useful. > What I like about the above computation is that day_of_year is trivial to implement as a one-liner, not needing a specialized function or attribute at the bottom library layer. Cheers, Mark > > Regards, > Dave > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 7 19:35:56 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 18:35:56 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 6:27 PM, Fernando Perez wrote: > Hi Mark, > > On Tue, Jun 7, 2011 at 10:09 AM, Mark Wiebe wrote: > > I'd like to merge the changes I've done to datetime so far into NumPy > master > > today. It's a fairly big chunk of changes, with the following highlights: > > Is it this pull request: https://github.com/numpy/numpy/pull/85 or is > there more work elsewhere? > I went ahead and did the merge today as I said I wanted to, that pull request is some further development for someone to code-review if they have time. -Mark > > f > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Tue Jun 7 19:53:11 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Wed, 8 Jun 2011 01:53:11 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> On Jun 8, 2011, at 1:16 AM, Mark Wiebe wrote: > Hi Dave, > > Thanks for all the feedback on the datetime, it's very useful to help understand the timeseries ideas, in particular with the many examples you're sprinkling in. > > One overall impression I have about timeseries in general is the use of the term "frequency" synonymously with the time unit. To me, a frequency is a numerical quantity with a unit of 1/(time unit), so while it's related to the time unit, naming it the same is something the specific timeseries domain has chosen to do, I think the numpy datetime class shouldn't have anything called "frequency" in it, and I would like to remove the current usage of that terminology from the codebase. True. We rather abused the term in scikits.timeseries, but we meant it as "given time unit". Matt came with the idea of representing a series of consecutive dates as an array of consecutive integers. The conversion integer<>datetime is done internally with an epoch and a unit. Initially, we called this latter frequency, but in the experimental git version I switched to unit. Anyhow, each time yo read 'frequency' in scikits.timeseries, think 'unit'. > In Wes's comment, he said > > I'm hopeful that the datetime64 dtype will enable scikits.timeseries > and pandas to consolidate much ofir the datetime / frequency code. > scikits.timeseries has a ton of great stuff for generating dates with > all the standard fixed frequencies. > > implying to me that the important functionality needed in time series is the ability to generate arrays of dates in specific ways. I suspect equating the specification of the array of dates and the unit of precision used to store the date isn't good for either the datetime functionality or supporting timeseries, and I'm presently trying to understand what it is that timeseries uses. You want a series of 365 consecutive days from today ? 'now' + np.arange(365). This kind of stuff. > On Tue, Jun 7, 2011 at 7:34 AM, Dave Hirschfeld wrote: > > I think some of the complexity is coming from the definition of the timedelta. > In the timeseries package each date simply represents the number of periods > since the epoch and the difference between dates is therefore just and integer > with no attached metadata - its meaning is determined by the context it's used > in. e.g. Exactly that. > timeseries gets on just fine without a timedelta type - a timedelta is just an > integer and if you add an integer to a date it's interpreted as the number of > periods of that dates frequency. From a useability point of view M1 + 1 is > much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1). Likewise, the difference between two dates is just an integer. [Mark] > I think the timedelta is important, especially with the large number of units NumPy's datetime supports. When you're subtracting two nanosecond datetimes and two minute datetimes in the same code, having the units there to avoid confusion is pretty useful. Indeed. > I don't envision 'asfreq' being a datetime function, this is the kind of thing that would layer on top in a specialized timeseries library. The behavior of timedelta follows a more physics-like idea with regard to the time unit, and I don't think something more complicated belongs at the bottom layer that is shared among all datetime uses. 'asfreq' converts from one unit to another (there's another function, convert, that does not quite exactly the same thing, but I won't get into details here). You'll probably have to take unit conversion into account if you allow the .view() or .astype() methods on your np.datetime array... > In [80]: ts.Date('S', (_64.value + _65.value)//2) > Out[80]: > > Adding dates definitely doesn't work, because datetimes have no zero, but I would express it like this: Well, it can be argued that the epoch is 0... But in scikits.timeseries, keep in mind that underneath, a DateArray is just an array of integer. [Dave] > I really like the idea of being able to specify multiples of the base frequency > - e.g. [7D] is equivalenty to [W] not the least because it provides an easy > way to specify quarters [3M] or seasons [6M] which are important in my work. > NB: I also deal with half-hourly and quarter-hourly timeseries and I'm sure > there are many other example which are all made possible by allowing > multipliers. Well, the experimental version kinda allowed that... > > This is one of the things where I think mixing the datetime storage precision with timeseries frequency seems counterproductive. Having different origins for datetime64 starting on different weekdays near 1970-01-01 doesn't seem like the right way to tackle the problem to me. I see other valid reasons for reintroducing the origin metadata, but this one I don't really like. We needed the concept to convert time series, for example from monthly to quarterly (what is the first month of the year (as in succession of 12 months) you want to start with ?) > From mwwiebe at gmail.com Tue Jun 7 20:10:03 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 7 Jun 2011 19:10:03 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: On Tue, Jun 7, 2011 at 6:53 PM, Pierre GM wrote: > > On Jun 8, 2011, at 1:16 AM, Mark Wiebe wrote: > > > Hi Dave, > > > > Thanks for all the feedback on the datetime, it's very useful to help > understand the timeseries ideas, in particular with the many examples you're > sprinkling in. > > > > One overall impression I have about timeseries in general is the use of > the term "frequency" synonymously with the time unit. To me, a frequency is > a numerical quantity with a unit of 1/(time unit), so while it's related to > the time unit, naming it the same is something the specific timeseries > domain has chosen to do, I think the numpy datetime class shouldn't have > anything called "frequency" in it, and I would like to remove the current > usage of that terminology from the codebase. > > True. We rather abused the term in scikits.timeseries, but we meant it as > "given time unit". > Matt came with the idea of representing a series of consecutive dates as an > array of consecutive integers. The conversion integer<>datetime is done > internally with an epoch and a unit. Initially, we called this latter > frequency, but in the experimental git version I switched to unit. Anyhow, > each time yo read 'frequency' in scikits.timeseries, think 'unit'. Sounds good. > In Wes's comment, he said > > > > I'm hopeful that the datetime64 dtype will enable scikits.timeseries > > and pandas to consolidate much ofir the datetime / frequency code. > > scikits.timeseries has a ton of great stuff for generating dates with > > all the standard fixed frequencies. > > > > implying to me that the important functionality needed in time series is > the ability to generate arrays of dates in specific ways. I suspect equating > the specification of the array of dates and the unit of precision used to > store the date isn't good for either the datetime functionality or > supporting timeseries, and I'm presently trying to understand what it is > that timeseries uses. > > You want a series of 365 consecutive days from today ? 'now' + > np.arange(365). This kind of stuff. > This one works: >>> np.datetime64('today') + np.arange(365) array(['2011-06-07', '2011-06-08', '2011-06-09', '2011-06-10', '2011-06-11', '2011-06-12', '2011-06-13', '2011-06-14', '2011-06-15', '2011-06-16', '2011-06-17', '2011-06-18', '2011-06-19', '2011-06-20', '2011-06-21', '2011-06-22', '2011-06-23', '2011-06-24', '2011-06-25', '2011-06-26', '2011-06-27', '2011-06-28', '2011-06-29', '2011-06-30', '2011-07-01', '2011-07-02', '2011-07-03', '2011-07-04', '2011-07-05', '2011-07-06', '2011-07-07', '2011-07-08', '2011-07-09', '2011-07-10', '2011-07-11', '2011-07-12', '2011-07-13', '2011-07-14', '2011-07-15', '2011-07-16', '2011-07-17', '2011-07-18', '2011-07-19', '2011-07-20', '2012-05-28', '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01', '2012-06-02', '2012-06-03', '2012-06-04', '2012-06-05'], dtype='datetime64[D]') >>> > > On Tue, Jun 7, 2011 at 7:34 AM, Dave Hirschfeld < > dave.hirschfeld at gmail.com> wrote: > > > > I think some of the complexity is coming from the definition of the > timedelta. > > In the timeseries package each date simply represents the number of > periods > > since the epoch and the difference between dates is therefore just and > integer > > with no attached metadata - its meaning is determined by the context it's > used > > in. e.g. > > Exactly that. > > > timeseries gets on just fine without a timedelta type - a timedelta is > just an > > integer and if you add an integer to a date it's interpreted as the > number of > > periods of that dates frequency. From a useability point of view M1 + 1 > is > > much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, > 1). > > Likewise, the difference between two dates is just an integer. > > [Mark] > > I think the timedelta is important, especially with the large number of > units NumPy's datetime supports. When you're subtracting two nanosecond > datetimes and two minute datetimes in the same code, having the units there > to avoid confusion is pretty useful. > > Indeed. > > > I don't envision 'asfreq' being a datetime function, this is the kind of > thing that would layer on top in a specialized timeseries library. The > behavior of timedelta follows a more physics-like idea with regard to the > time unit, and I don't think something more complicated belongs at the > bottom layer that is shared among all datetime uses. > > 'asfreq' converts from one unit to another (there's another function, > convert, that does not quite exactly the same thing, but I won't get into > details here). You'll probably have to take unit conversion into account if > you allow the .view() or .astype() methods on your np.datetime array... > It supports .astype(), with a truncation policy. This is motivated partially because that's how Pythons integer division works, and partially because if you consider a full datetime '2011-03-14T13:22:16', it's natural to think of the year as '2011', the date as '2011-03-14', etc, which is truncation. With regards to converting in the other direction, you can think of a datetime as representing a single moment in time, regardless of its unit of precision, and equate '2011' with '2011-01', etc. > In [80]: ts.Date('S', (_64.value + _65.value)//2) > > Out[80]: > > > > Adding dates definitely doesn't work, because datetimes have no zero, but > I would express it like this: > > Well, it can be argued that the epoch is 0... But in scikits.timeseries, > keep in mind that underneath, a DateArray is just an array of integer. > Yeah, that's the implementation, but letting the abstraction leak doesn't provide a real benefit I can see. [Dave] > > I really like the idea of being able to specify multiples of the base > frequency > > - e.g. [7D] is equivalenty to [W] not the least because it provides an > easy > > way to specify quarters [3M] or seasons [6M] which are important in my > work. > > NB: I also deal with half-hourly and quarter-hourly timeseries and I'm > sure > > there are many other example which are all made possible by allowing > > multipliers. > > Well, the experimental version kinda allowed that... > > > > > This is one of the things where I think mixing the datetime storage > precision with timeseries frequency seems counterproductive. Having > different origins for datetime64 starting on different weekdays near > 1970-01-01 doesn't seem like the right way to tackle the problem to me. I > see other valid reasons for reintroducing the origin metadata, but this one > I don't really like. > > We needed the concept to convert time series, for example from monthly to > quarterly (what is the first month of the year (as in succession of 12 > months) you want to start with ?) Does that need to be in the underlying datetime for layering a good timeseries implementation on top? -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Tue Jun 7 20:28:09 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Wed, 8 Jun 2011 02:28:09 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: > > It supports .astype(), with a truncation policy. This is motivated partially because that's how Pythons integer division works, and partially because if you consider a full datetime '2011-03-14T13:22:16', it's natural to think of the year as '2011', the date as '2011-03-14', etc, which is truncation. With regards to converting in the other direction, you can think of a datetime as representing a single moment in time, regardless of its unit of precision, and equate '2011' with '2011-01', etc. OK from high to low, less from low to high. That's where our keyword ('END' or "START") comes into play in scikits.timeseries, so that you can decide whether '2011' should be '2011-01' or '2011-06'... > We needed the concept to convert time series, for example from monthly to quarterly (what is the first month of the year (as in succession of 12 months) you want to start with ?) > > Does that need to be in the underlying datetime for layering a good timeseries implementation on top? Mmh. How would you define a quarter unit ? [3M] ? But then, what if you want your year to start in December, say (we often use DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological' seasons, for example) From amcmorl at gmail.com Tue Jun 7 20:53:24 2011 From: amcmorl at gmail.com (Angus McMorland) Date: Tue, 7 Jun 2011 20:53:24 -0400 Subject: [Numpy-discussion] git source datetime build error In-Reply-To: References: Message-ID: On 7 June 2011 17:57, Mark Wiebe wrote: > Hi Angus, > Thanks for reporting that, I've committed a fix so it builds in the > monolithic mode again. Thanks Mark. That fixed it for i686, but on x86_64 I get the following. Still looks datetime related. compile options: '-Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -Inumpy/core/include -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' gcc: numpy/core/src/umath/umathmodule_onefile.c In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:68: error: ?TIMEDELTA_ml_m_divide? undeclared here (not in a function) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: ?TIMEDELTA_ml_m_floor_divide? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: (near initialization for ?floor_divide_functions[17]?) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: ?TIMEDELTA_ml_m_multiply? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: (near initialization for ?multiply_functions[18]?) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: ?TIMEDELTA_lm_m_multiply? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: (near initialization for ?multiply_functions[19]?) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: ?TIMEDELTA_ml_m_true_divide? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: (near initialization for ?true_divide_functions[17]?) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:68: error: ?TIMEDELTA_ml_m_divide? undeclared here (not in a function) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: ?TIMEDELTA_ml_m_floor_divide? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: error: (near initialization for ?floor_divide_functions[17]?) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: ?TIMEDELTA_ml_m_multiply? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: (near initialization for ?multiply_functions[18]?) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: ?TIMEDELTA_lm_m_multiply? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: error: (near initialization for ?multiply_functions[19]?) In file included from numpy/core/src/umath/umathmodule.c.src:41, from numpy/core/src/umath/umathmodule_onefile.c:4: build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: ?TIMEDELTA_ml_m_true_divide? undeclared here (not in a function) build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: initializer element is not constant build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: error: (near initialization for ?true_divide_functions[17]?) error: Command "gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -Inumpy/core/include -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c numpy/core/src/umath/umathmodule_onefile.c -o build/temp.linux-x86_64-2.6/numpy/core/src/umath/umathmodule_onefile.o" failed with exit status 1 -- AJC McMorland Post-doctoral research fellow Neurobiology, University of Pittsburgh From fperez.net at gmail.com Wed Jun 8 00:37:52 2011 From: fperez.net at gmail.com (Fernando Perez) Date: Tue, 7 Jun 2011 21:37:52 -0700 Subject: [Numpy-discussion] git source datetime build error In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 5:53 PM, Angus McMorland wrote: > Thanks Mark. That fixed it for i686, but on x86_64 I get the > following. Still looks datetime related. Same failure here... From fperez.net at gmail.com Wed Jun 8 00:52:23 2011 From: fperez.net at gmail.com (Fernando Perez) Date: Tue, 7 Jun 2011 21:52:23 -0700 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 4:35 PM, Mark Wiebe wrote: > I went ahead and did the merge today as I said I wanted to, that pull > request is some further development for someone to code-review if they have > time. I'm curious as to why there was a need to push ahead with the merge right away, without giving the original pull request more time for feedback? If I'm not mistaken, the big merge was this PR: https://github.com/numpy/numpy/pull/83 and it was just opened a few days ago, containing a massive amount of work, and so far had only received some feedback from charris, explicitly requesting a little more breakdown to make digesting it easier. I realize that I'm not really an active numpy contributor in any significant way, and I see that you've put a ton of work into this, including a very detailed and impressive discussion on the list on with multiple people. So my opinion is just that of a user, not really a core numpy developer. But it seems to me that part of having numpy be a better community-driven project is precisely achieved by having the patience to allow others to provide feedback and testing, even if they aren't 100% experts. And one thing that github really shines at, is making the review/feedback process about as painless as possible (I actually find it kind of fun). For example, with this merge, numpy HEAD right now won't even compile on x86_64, something that would easily have been caught with a bit more review, especially since it's so easy to test (even I can do that). It's been a long time since we had a situation where numpy didn't cleanly at least build from HEAD, so if nothing else, it's a (small) sign that this particular merge could have used a few more eyes... I realize that it's sometimes frustrating to have a lot of code sitting in review, and I know that certain efforts are large and self-contained enough that it's impractical to expect a detailed line-by-line review. We've had a few such monster branches in ipython in the past, but at least in those cases we've always tried to ensure several core people (over skype if needed) have a chance to go over the entire big picture, discuss the main details with the author so they can know what to focus on from the large merge, and run the tests in as many scenarios as is realistic. And so far we haven't really had any problems with this approach, even if it does require a little more patience in seeing that (often very high quality) work make it to the mainline. Regards, f From cournape at gmail.com Wed Jun 8 01:05:21 2011 From: cournape at gmail.com (David Cournapeau) Date: Wed, 8 Jun 2011 14:05:21 +0900 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: On Wed, Jun 8, 2011 at 1:52 PM, Fernando Perez wrote: > On Tue, Jun 7, 2011 at 4:35 PM, Mark Wiebe wrote: >> I went ahead and did the merge today as I said I wanted to, that pull >> request is some further development for someone to code-review if they have >> time. > > I'm curious as to why there was a need to push ahead with the merge > right away, without giving the original pull request more time for > feedback? ?If I'm not mistaken, the big merge was this PR: > > https://github.com/numpy/numpy/pull/83 > > and it was just opened a few days ago, containing a massive amount of > work, and so far had only received some feedback from charris, > explicitly requesting a little more breakdown to make digesting it > easier. > > I realize that I'm not really an active numpy contributor in any > significant way, and I see that you've put a ton of work into this, > including a very detailed and impressive discussion on the list on > with multiple people. ?So my opinion is just that of a user, not > really a core numpy developer. +1. There is no need to push this, and giving only a few hours for a review is not realistic (if only because of time differences around the globe). I would advise to revert that merge, and make sure the original branch got proper review. Giving a few days for a change involving > 10000 lines of code seems quite reasonable to me, cheers, David From Chris.Barker at noaa.gov Wed Jun 8 01:36:14 2011 From: Chris.Barker at noaa.gov (Chris Barker) Date: Tue, 07 Jun 2011 22:36:14 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: <4DEF0A4E.7030509@noaa.gov> On 6/7/11 4:53 PM, Pierre GM wrote: > Anyhow, each time yo > read 'frequency' in scikits.timeseries, think 'unit'. or maybe "precision" -- when I think if unit, I think of something that can be represented as a floating point value -- but here, with integers, it's the precision that can be represented. Just a thought. > Well, it can be argued that the epoch is 0... yes, but that really should be transparent to the user -- what epoch is chosen should influence as little as possible (e.g. only the range of values representable) > Mmh. How would you define a quarter unit ? [3M] ? But then, what if > you want your year to start in December, say (we often use > DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological' > seasons, for example) And the federal fiscal year is Oct - Sept, so the first quarter is (Oct, Nov, Dec) -- clearly that needs to be flexible. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From yoshi at rokuko.net Wed Jun 8 03:58:10 2011 From: yoshi at rokuko.net (Yoshi Rokuko) Date: Wed, 08 Jun 2011 09:58:10 +0200 Subject: [Numpy-discussion] numpy c api general array handling Message-ID: <201106080758.p587wAP7032347@lotus.yokuts.org> hey, i'm writing my first python module in c using numpy/arrayobject.h and i have some problems with different platforms (both linux but really different setup) so i suspect my array handling is not cor- rect. i'm used to c arrays and want to access large numpy arrays from within my c module with no strange array-iter-methods. So what are your remarks to the following: PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, PyArray_INT, 1, 1); ix = (int *)index->data; then using like: for(k = ix[i]; k < ix[i+1]; k++) l = ix[k]; this seems to work pretty well, but real problems come with: PyArg_ParseTuple(args, "O", &arg); adjc = (PyArrayObject *) PyArray_ContiguousFromObject(arg, PyArray_INT, 2, 2); a = (int *)adjc->data; and then using like: aik = a[i + n * k]; it seems like on one system it does not accept 2d numpy arrays just 1d ones or i must hand him a list of 1d numpy arrays like that: >>> A = np.array([[1,0],[0,1]]) >>> B = my.method(list(A)) i would prefer to avoid the list() call in python. what are your remarks on performance would it be faster to do: PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, PyArray_INT, 1, 1); ix = malloc(n * sizeof(int)); for(i = 0; i < n; i++) ix[i] = (int *)index->data[i]; and then use ix[] (i call a lot on ix). thank you and best regards, yoshi From wesmckinn at gmail.com Wed Jun 8 05:57:07 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 8 Jun 2011 11:57:07 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DEF0A4E.7030509@noaa.gov> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Wed, Jun 8, 2011 at 7:36 AM, Chris Barker wrote: > On 6/7/11 4:53 PM, Pierre GM wrote: >> ?Anyhow, each time yo >> read 'frequency' in scikits.timeseries, think 'unit'. > > or maybe "precision" -- when I think if unit, I think of something that > can be represented as a floating point value -- but here, with integers, > it's the precision that can be represented. Just a thought. > >> Well, it can be argued that the epoch is 0... > > yes, but that really should be transparent to the user -- what epoch is > chosen should influence as little as possible (e.g. only the range of > values representable) > >> Mmh. How would you define a quarter unit ? [3M] ? But then, what if >> you want your year to start in December, say (we often use >> DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological' >> seasons, for example) > > And the federal fiscal year is Oct - Sept, so the first quarter is (Oct, > Nov, Dec) -- clearly that needs to be flexible. > > > -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R ? ? ? ? ? ?(206) 526-6959 ? voice > 7600 Sand Point Way NE ? (206) 526-6329 ? fax > Seattle, WA ?98115 ? ? ? (206) 526-6317 ? main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Your guys' discussion is a bit overwhelming for me in my currently jet-lagged state ( =) ) but I thought I would comment on a couple things, especially now with the input of another financial Python user (great!). Note that I use scikits.timeseries very little for a few reasons (a bit OT, but...): - Fundamental need to be able to work with multiple time series, especially performing operations involving cross-sectional data - I think it's a bit hard for lay people to use (read: ex-MATLAB/R users). This is just my opinion, but a few years ago I thought about using it and concluded that teaching people how to properly use it (a precision tool, indeed!) was going to cause me grief. - The data alignment problem, best explained in code: In [8]: ts Out[8]: 2000-01-05 00:00:00 0.0503706684002 2000-01-12 00:00:00 -1.7660004939 2000-01-19 00:00:00 1.11716758554 2000-01-26 00:00:00 -0.171029995265 2000-02-02 00:00:00 -0.99876580126 2000-02-09 00:00:00 -0.262729046405 In [9]: ts.index Out[9]: offset: <1 Week: kwds={'weekday': 2}, weekday=2>, tzinfo: None [2000-01-05 00:00:00, ..., 2000-02-09 00:00:00] length: 6 In [10]: ts2 = ts[:4] In [11]: ts2.index Out[11]: offset: <1 Week: kwds={'weekday': 2}, weekday=2>, tzinfo: None [2000-01-05 00:00:00, ..., 2000-01-26 00:00:00] length: 4 In [12]: ts + ts2 Out[12]: 2000-01-05 00:00:00 0.1007413368 2000-01-12 00:00:00 -3.5320009878 2000-01-19 00:00:00 2.23433517109 2000-01-26 00:00:00 -0.34205999053 2000-02-02 00:00:00 NaN 2000-02-09 00:00:00 NaN Or ts / or ts2 could be completely DateRange-naive (e.g. they have no way of knowing that they are fixed-frequency), or even out of order, and stuff like this will work no problem. I view the "fixed frequency" issue as sort of an afterthought-- if you need it, it's there for you (the DateRange class is a valid Index--"label vector"--for pandas objects, and provides an API for defining custom time deltas). Which leads me to: - Inability to derive custom offsets: I can do: In [14]: ts.shift(2, offset=2 * datetools.BDay()) Out[14]: 2000-01-11 00:00:00 0.0503706684002 2000-01-18 00:00:00 -1.7660004939 2000-01-25 00:00:00 1.11716758554 2000-02-01 00:00:00 -0.171029995265 2000-02-08 00:00:00 -0.99876580126 2000-02-15 00:00:00 -0.262729046405 or even generate, say, 5-minutely or 10-minutely date ranges thusly: In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', offset=datetools.Minute(5)) Out[16]: offset: <5 Minutes>, tzinfo: None [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] length: 85 I'm currently working on high perf reduceat-based resampling methods (e.g. converting secondly data to 5-minutely data). So in summary, w.r.t. time series data and datetime, the only things I care about from a datetime / pandas point of view: - Ability to easily define custom timedeltas - Generate datetime objects, or some equivalent, which can be used to back pandas data structures - (possible now??) Ability to have a set of frequency-naive dates (possibly not in order). This last point actually matters. Suppose you wanted to get the worst 5-performing days in the S&P 500 index: In [7]: spx.index Out[7]: offset: <1 BusinessDay>, tzinfo: None [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] length: 2963 # but this is OK In [8]: spx.order()[:5] Out[8]: 2008-10-15 00:00:00 -0.0903497960942 2008-12-01 00:00:00 -0.0892952780505 2008-09-29 00:00:00 -0.0878970494885 2008-10-09 00:00:00 -0.0761670761671 2008-11-20 00:00:00 -0.0671229140321 - W From wesmckinn at gmail.com Wed Jun 8 06:05:49 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 8 Jun 2011 12:05:49 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Wed, Jun 8, 2011 at 11:57 AM, Wes McKinney wrote: > On Wed, Jun 8, 2011 at 7:36 AM, Chris Barker wrote: >> On 6/7/11 4:53 PM, Pierre GM wrote: >>> ?Anyhow, each time yo >>> read 'frequency' in scikits.timeseries, think 'unit'. >> >> or maybe "precision" -- when I think if unit, I think of something that >> can be represented as a floating point value -- but here, with integers, >> it's the precision that can be represented. Just a thought. >> >>> Well, it can be argued that the epoch is 0... >> >> yes, but that really should be transparent to the user -- what epoch is >> chosen should influence as little as possible (e.g. only the range of >> values representable) >> >>> Mmh. How would you define a quarter unit ? [3M] ? But then, what if >>> you want your year to start in December, say (we often use >>> DJF/MAM/JJA/SON as a way to decompose a year in four 'hydrological' >>> seasons, for example) >> >> And the federal fiscal year is Oct - Sept, so the first quarter is (Oct, >> Nov, Dec) -- clearly that needs to be flexible. >> >> >> -Chris >> >> >> >> >> -- >> Christopher Barker, Ph.D. >> Oceanographer >> >> Emergency Response Division >> NOAA/NOS/OR&R ? ? ? ? ? ?(206) 526-6959 ? voice >> 7600 Sand Point Way NE ? (206) 526-6329 ? fax >> Seattle, WA ?98115 ? ? ? (206) 526-6317 ? main reception >> >> Chris.Barker at noaa.gov >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > Your guys' discussion is a bit overwhelming for me in my currently > jet-lagged state ( =) ) but I thought I would comment on a couple > things, especially now with the input of another financial Python user > (great!). > > Note that I use scikits.timeseries very little for a few reasons (a > bit OT, but...): > > - Fundamental need to be able to work with multiple time series, > especially performing operations involving cross-sectional data > - I think it's a bit hard for lay people to use (read: ex-MATLAB/R > users). This is just my opinion, but a few years ago I thought about > using it and concluded that teaching people how to properly use it (a > precision tool, indeed!) was going to cause me grief. > - The data alignment problem, best explained in code: > > In [8]: ts > Out[8]: > 2000-01-05 00:00:00 ? ?0.0503706684002 > 2000-01-12 00:00:00 ? ?-1.7660004939 > 2000-01-19 00:00:00 ? ?1.11716758554 > 2000-01-26 00:00:00 ? ?-0.171029995265 > 2000-02-02 00:00:00 ? ?-0.99876580126 > 2000-02-09 00:00:00 ? ?-0.262729046405 > > In [9]: ts.index > Out[9]: > > offset: <1 Week: kwds={'weekday': 2}, weekday=2>, tzinfo: None > [2000-01-05 00:00:00, ..., 2000-02-09 00:00:00] > length: 6 > > In [10]: ts2 = ts[:4] > > In [11]: ts2.index > Out[11]: > > offset: <1 Week: kwds={'weekday': 2}, weekday=2>, tzinfo: None > [2000-01-05 00:00:00, ..., 2000-01-26 00:00:00] > length: 4 > > In [12]: ts + ts2 > Out[12]: > 2000-01-05 00:00:00 ? ?0.1007413368 > 2000-01-12 00:00:00 ? ?-3.5320009878 > 2000-01-19 00:00:00 ? ?2.23433517109 > 2000-01-26 00:00:00 ? ?-0.34205999053 > 2000-02-02 00:00:00 ? ?NaN > 2000-02-09 00:00:00 ? ?NaN > > Or ts / or ts2 could be completely DateRange-naive (e.g. they have no > way of knowing that they are fixed-frequency), or even out of order, > and stuff like this will work no problem. I view the "fixed frequency" > issue as sort of an afterthought-- if you need it, it's there for you > (the DateRange class is a valid Index--"label vector"--for pandas > objects, and provides an API for defining custom time deltas). Which > leads me to: > > - Inability to derive custom offsets: > > I can do: > > In [14]: ts.shift(2, offset=2 * datetools.BDay()) > Out[14]: > 2000-01-11 00:00:00 ? ?0.0503706684002 > 2000-01-18 00:00:00 ? ?-1.7660004939 > 2000-01-25 00:00:00 ? ?1.11716758554 > 2000-02-01 00:00:00 ? ?-0.171029995265 > 2000-02-08 00:00:00 ? ?-0.99876580126 > 2000-02-15 00:00:00 ? ?-0.262729046405 > > or even generate, say, 5-minutely or 10-minutely date ranges thusly: > > In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', > offset=datetools.Minute(5)) > Out[16]: > > offset: <5 Minutes>, tzinfo: None > [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] > length: 85 > > I'm currently working on high perf reduceat-based resampling methods > (e.g. converting secondly data to 5-minutely data). > > So in summary, w.r.t. time series data and datetime, the only things I > care about from a datetime / pandas point of view: > > - Ability to easily define custom timedeltas > - Generate datetime objects, or some equivalent, which can be used to > back pandas data structures > - (possible now??) Ability to have a set of frequency-naive dates > (possibly not in order). > > This last point actually matters. Suppose you wanted to get the worst > 5-performing days in the S&P 500 index: > > In [7]: spx.index > Out[7]: > > offset: <1 BusinessDay>, tzinfo: None > [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] > length: 2963 > > # but this is OK > In [8]: spx.order()[:5] > Out[8]: > 2008-10-15 00:00:00 ? ?-0.0903497960942 > 2008-12-01 00:00:00 ? ?-0.0892952780505 > 2008-09-29 00:00:00 ? ?-0.0878970494885 > 2008-10-09 00:00:00 ? ?-0.0761670761671 > 2008-11-20 00:00:00 ? ?-0.0671229140321 > > - W > I should add that if datetime64 gets me 80% to solving my needs (which are rather domain-specific), I will be very happy. Reducing the memory footprint of long time series (versus having millions of datetime.datetime objects lying around) will also be a big benefit. From dave.hirschfeld at gmail.com Wed Jun 8 06:37:38 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Wed, 8 Jun 2011 10:37:38 +0000 (UTC) Subject: [Numpy-discussion] fixing up datetime References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: Wes McKinney gmail.com> writes: > > > > > - Fundamental need to be able to work with multiple time series, > > especially performing operations involving cross-sectional data > > - I think it's a bit hard for lay people to use (read: ex-MATLAB/R > > users). This is just my opinion, but a few years ago I thought about > > using it and concluded that teaching people how to properly use it (a > > precision tool, indeed!) was going to cause me grief. > > - The data alignment problem, best explained in code: > > > > > > - Inability to derive custom offsets: > > > > I can do: > > > > In [14]: ts.shift(2, offset=2 * datetools.BDay()) > > Out[14]: > > 2000-01-11 00:00:00 ? ?0.0503706684002 > > 2000-01-18 00:00:00 ? ?-1.7660004939 > > 2000-01-25 00:00:00 ? ?1.11716758554 > > 2000-02-01 00:00:00 ? ?-0.171029995265 > > 2000-02-08 00:00:00 ? ?-0.99876580126 > > 2000-02-15 00:00:00 ? ?-0.262729046405 > > > > or even generate, say, 5-minutely or 10-minutely date ranges thusly: > > > > In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', > > offset=datetools.Minute(5)) > > Out[16]: > > > > offset: <5 Minutes>, tzinfo: None > > [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] > > length: 85 > > It would be nice to have a step argument in the date_array function. The following works though: In [96]: integers = r_[ts.Date('T',"01-Aug-2011 05:00").value: ts.Date('T',"06-Aug-2011 12:01").value: 5] In [97]: ts.date_array(integers, freq='T') Out[97]: DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ..., 06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ..., 06-Aug-2011 12:00], freq='T') > > - (possible now??) Ability to have a set of frequency-naive dates > > (possibly not in order). > > > > This last point actually matters. Suppose you wanted to get the worst > > 5-performing days in the S&P 500 index: > > > > In [7]: spx.index > > Out[7]: > > > > offset: <1 BusinessDay>, tzinfo: None > > [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] > > length: 2963 > > > > # but this is OK > > In [8]: spx.order()[:5] > > Out[8]: > > 2008-10-15 00:00:00 ? ?-0.0903497960942 > > 2008-12-01 00:00:00 ? ?-0.0892952780505 > > 2008-09-29 00:00:00 ? ?-0.0878970494885 > > 2008-10-09 00:00:00 ? ?-0.0761670761671 > > 2008-11-20 00:00:00 ? ?-0.0671229140321 > > > > - W > > Seems like you're looking for the tssort function: In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D') In [91]: def tssort(series): ....: indices = series.argsort() ....: return ts.time_series(series._series[indices], ....: series.dates[indices], ....: autosort=False) ....: In [92]: tssort(series)[0:5] Out[92]: timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ], dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011], freq = D) Pandas seems to offer an even higher level api than scikits.timeseries. I view them as mostly complementary but I haven't (yet) had much experience with pandas... -Dave From wesmckinn at gmail.com Wed Jun 8 06:46:23 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 8 Jun 2011 06:46:23 -0400 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Wed, Jun 8, 2011 at 6:37 AM, Dave Hirschfeld wrote: > Wes McKinney gmail.com> writes: > >> >> > >> > - Fundamental need to be able to work with multiple time series, >> > especially performing operations involving cross-sectional data >> > - I think it's a bit hard for lay people to use (read: ex-MATLAB/R >> > users). This is just my opinion, but a few years ago I thought about >> > using it and concluded that teaching people how to properly use it (a >> > precision tool, indeed!) was going to cause me grief. >> > - The data alignment problem, best explained in code: >> > > >> > >> > - Inability to derive custom offsets: >> > >> > I can do: >> > >> > In [14]: ts.shift(2, offset=2 * datetools.BDay()) >> > Out[14]: >> > 2000-01-11 00:00:00 ? ?0.0503706684002 >> > 2000-01-18 00:00:00 ? ?-1.7660004939 >> > 2000-01-25 00:00:00 ? ?1.11716758554 >> > 2000-02-01 00:00:00 ? ?-0.171029995265 >> > 2000-02-08 00:00:00 ? ?-0.99876580126 >> > 2000-02-15 00:00:00 ? ?-0.262729046405 >> > >> > or even generate, say, 5-minutely or 10-minutely date ranges thusly: >> > >> > In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', >> > offset=datetools.Minute(5)) >> > Out[16]: >> > >> > offset: <5 Minutes>, tzinfo: None >> > [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] >> > length: 85 >> > > > It would be nice to have a step argument in the date_array function. The > following works though: > > In [96]: integers = r_[ts.Date('T',"01-Aug-2011 05:00").value: > ? ? ? ? ? ? ? ? ? ? ? ts.Date('T',"06-Aug-2011 12:01").value: > ? ? ? ? ? ? ? ? ? ? ? 5] > > In [97]: ts.date_array(integers, freq='T') > Out[97]: > DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ..., > ? ? ? 06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ..., > ? ? ? 06-Aug-2011 12:00], > ? ? ? ? ?freq='T') > > >> > - (possible now??) Ability to have a set of frequency-naive dates >> > (possibly not in order). >> > >> > This last point actually matters. Suppose you wanted to get the worst >> > 5-performing days in the S&P 500 index: >> > >> > In [7]: spx.index >> > Out[7]: >> > >> > offset: <1 BusinessDay>, tzinfo: None >> > [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] >> > length: 2963 >> > >> > # but this is OK >> > In [8]: spx.order()[:5] >> > Out[8]: >> > 2008-10-15 00:00:00 ? ?-0.0903497960942 >> > 2008-12-01 00:00:00 ? ?-0.0892952780505 >> > 2008-09-29 00:00:00 ? ?-0.0878970494885 >> > 2008-10-09 00:00:00 ? ?-0.0761670761671 >> > 2008-11-20 00:00:00 ? ?-0.0671229140321 >> > >> > - W >> > > > Seems like you're looking for the tssort function: > > In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D') > > In [91]: def tssort(series): > ? ....: ? ? indices = series.argsort() > ? ....: ? ? return ts.time_series(series._series[indices], > ? ....: ? ? ? ? ? ? ? ? ? ? ? ? ? series.dates[indices], > ? ....: ? ? ? ? ? ? ? ? ? ? ? ? ? autosort=False) > ? ....: > > In [92]: tssort(series)[0:5] > Out[92]: > timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ], > ? dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011], > ? freq ?= D) > > > Pandas seems to offer an even higher level api than scikits.timeseries. I view > them as mostly complementary but I haven't (yet) had much experience with > pandas... > > -Dave > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > I admit that my partial ignorance about scikits.timeseries (due to seldom use) makes my statements a bit unfair-- but you're right in that it's a high-level versus low-level thing. Just about everything is possible but may require a certain level of experience / sophistication / deep understanding (as you have). From dave.hirschfeld at gmail.com Wed Jun 8 06:59:48 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Wed, 8 Jun 2011 10:59:48 +0000 (UTC) Subject: [Numpy-discussion] fixing up datetime References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: Mark Wiebe gmail.com> writes: > > > It appears to me that a structured dtype with some further NumPy extensions > could entirely replace the 'events' metadata fairly cleanly. If the ufuncs > are extended to operate on structured arrays, and integers modulo n are > added as a new dtype, a dtype like > [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current > 'M8[D]//100'. Sounds like a cleaner API. > >> >> As Dave H. summarized, we used a basic keyword to do the same thing in >> scikits.timeseries, with the addition of some subfrequencies like A-SEP >> to represent a year starting in September, for example. It works, but it's >> really not elegant a solution. >> > > This kind of thing definitely belongs in a layer above datetime. > That's fair enough - my perspective as a timeseries user is probably a lot higher level. My main aim is to point out some of the higher level uses so that the numpy dtype can be made compatible with them - I'd hate to have a situation where we have multiple different datetime representations in the different libraries and having to continually convert at the boundaries. That said, I think the starting point for a series at a weekly, quarterly or annual frequency/unit/type is something which may need to be sorted out at the lowest level... > > One overall impression I have about timeseries in general is the use of the > term "frequency" synonymously with the time unit. To me, a frequency is a > numerical quantity with a unit of 1/(time unit), so while it's related to > the time unit, naming it the same is something the specific timeseries > domain has chosen to do, I think the numpy datetime class shouldn't have > anything called "frequency" in it, and I would like to remove the current > usage of that terminology from the codebase. > It seems that it's just a naming convention (possibly not the best) and can be used synonymously with the "time unit"/resolution/dtype > I don't envision 'asfreq' being a datetime function, this is the kind > of thing that would layer on top in a specialized timeseries library. The > behavior of timedelta follows a more physics-like idea with regard to the > time unit, and I don't think something more complicated belongs at the bottom > layer that is shared among all datetime uses. I think since freq <==> dtype then asfreq <==> astype. From your examples it seems to do the same thing - i.e. if you go to a lower resolution (freq) representation the higher resolution information is truncated - e.g. a monthly resolution date has no information about days/hours/minutes/seconds etc. It's converting in the other direction: low --> high resolution where the difference lies - numpy always converts to the start of the interval whereas the timeseries Date class gives you the option of the start or the end. > I'm thinking of a datetime as an infinitesimal moment in time, with the > unit representing the precision that can be stored. Thus, '2011', > '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in > time, but with a different unit of precision. Computationally, this > perspective is turning out to provide a pretty rich mechanism to do > operations on date. I think this is where the main point of difference is. I use the timeseries as a container for data which arrives in a certain interval of time. e.g. monthly temperatures are the average temperature over the interval defined by the particular month. It's not the temperature at the instant of time that the month began, or ended or of any particular instant in that month. Thus the conversion from a monthly resolution to say a daily resolution isn't well defined and the right thing to do is likely to be application specific. For the average temperature example you may want to choose a value in the middle of the month so that you don't introduce a phase delay if you interpolate between the datapoints. If for example you measured total rainfall in a month you might want to choose the last day of the month to represent the total rainfall for that month as that was the only date in the interval where the total rainfall did in fact equal the monthly value. It may be as you say though that all this functionality does belong at a higher level... Regards, Dave From mwwiebe at gmail.com Wed Jun 8 11:26:16 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 10:26:16 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 11:52 PM, Fernando Perez wrote: > On Tue, Jun 7, 2011 at 4:35 PM, Mark Wiebe wrote: > > I went ahead and did the merge today as I said I wanted to, that pull > > request is some further development for someone to code-review if they > have > > time. > > I'm curious as to why there was a need to push ahead with the merge > right away, without giving the original pull request more time for > feedback? If I'm not mistaken, the big merge was this PR: > > https://github.com/numpy/numpy/pull/83 > > and it was just opened a few days ago, containing a massive amount of > work, and so far had only received some feedback from charris, > explicitly requesting a little more breakdown to make digesting it > easier. > This is all true, I'll try to explain what my thought process was in doing the merge. This set of changes basically takes the codebase from a basically unusable datetime to a good starting point for all the design discussions that we've been having on the mailing list. I opened the pull request with about half of the changes that were there to try and get some feedback, and kept developing on a datetime-fixes2 branch. When I reached the point that I later merged in, nobody had responded so I added the new commits to the same pull request. Perhaps I should have more patience with git history-editing and revisiting the development history, but I've basically come to the conclusion that it's not worth the effort except on relatively minor things, so I want to rather focus on the code and its design instead of the particular series of commits that produced it. In doing those commits, I had to repeatedly double back and implement new missing functionality before returning to and finishing up what I was working on. For the development I'm doing now, which is related to the multitude of design discussions, I'm splitting it up into more topical branches partially because the work I merged provides a solid foundation for doing so, and because these are things diverging from the NEP instead of being things I perceived as having already had a discussion process during the NEP formation. I tried to see if github would let me do a "dependent" pull request, but it just included the commits from the branch my later development was sitting on, and that's probably the main reason I wanted to do a post-commit style review for this set of changes instead of pre-commit. I wrote this email to try and communicate my transition from pre-commit to post-commit review, but I think my wording about this probably wasn't clear. I realize that I'm not really an active numpy contributor in any > significant way, and I see that you've put a ton of work into this, > including a very detailed and impressive discussion on the list on > with multiple people. So my opinion is just that of a user, not > really a core numpy developer. > I think your opinion is much more than that, particularly since you're actively working on closely related projects using the same community infrastructure. But it seems to me that part of having numpy be a better > community-driven project is precisely achieved by having the patience > to allow others to provide feedback and testing, even if they aren't > 100% experts. And one thing that github really shines at, is making > the review/feedback process about as painless as possible (I actually > find it kind of fun). > Definitely true (except for not having dependent pull requests, unless my search was too shallow...). That pull request also has nothing to do with the discussion we're currently having, it's more of a prerequisite, so anyone who is following the discussion and wants to dig in and review the code I'm doing related to that discussion will be lost in a swamp. By merging this prerequisite, and introducing separated, cleaner pull requests on that base that are directly from issues being discussed, this kind of community collaboration is much more likely to happen. I've simply done a bad job of communicating this, and as I'm doing the things we're discussing I'll try and tie these different elements better to encourage the ideals you're describing. For example, with this merge, numpy HEAD right now won't even compile > on x86_64, something that would easily have been caught with a bit > more review, especially since it's so easy to test (even I can do > that). It's been a long time since we had a situation where numpy > didn't cleanly at least build from HEAD, so if nothing else, it's a > (small) sign that this particular merge could have used a few more > eyes... > I apologize for that, I've grown accustomed to having little to no review to my pull requests, except from Chuck whose time and effort I've greatly appreciated, and has significantly improved the contributions I've made. The only active C-level NumPy development currently appears to be what I'm doing, and the great clean-up/extension proposals that Chuck has emailed about. I would like it if the buildbot system worked better to let me automatically trigger some build/tests on a variety of platforms before merging a branch, but it is as it is. I realize that it's sometimes frustrating to have a lot of code > sitting in review, and I know that certain efforts are large and > self-contained enough that it's impractical to expect a detailed > line-by-line review. We've had a few such monster branches in ipython > in the past, but at least in those cases we've always tried to ensure > several core people (over skype if needed) have a chance to go over > the entire big picture, discuss the main details with the author so > they can know what to focus on from the large merge, and run the tests > in as many scenarios as is realistic. And so far we haven't really > had any problems with this approach, even if it does require a little > more patience in seeing that (often very high quality) work make it to > the mainline. > That approach sounds great, and NumPy needs more active core developers! Cheers, Mark > > Regards, > > f > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 11:27:15 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 10:27:15 -0500 Subject: [Numpy-discussion] git source datetime build error In-Reply-To: References: Message-ID: I've committed a fix for x86_64 as well now. Sorry for the breakage! -Mark On Tue, Jun 7, 2011 at 7:53 PM, Angus McMorland wrote: > On 7 June 2011 17:57, Mark Wiebe wrote: > > Hi Angus, > > Thanks for reporting that, I've committed a fix so it builds in the > > monolithic mode again. > > Thanks Mark. That fixed it for i686, but on x86_64 I get the > following. Still looks datetime related. > > compile options: '-Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath > -Inumpy/core/include > -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray > -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' > gcc: numpy/core/src/umath/umathmodule_onefile.c > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:68: > error: ?TIMEDELTA_ml_m_divide? undeclared here (not in a function) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: ?TIMEDELTA_ml_m_floor_divide? undeclared here (not in a > function) > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: initializer element is not constant > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: (near initialization for ?floor_divide_functions[17]?) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: ?TIMEDELTA_ml_m_multiply? undeclared here (not in a function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: (near initialization for ?multiply_functions[18]?) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: ?TIMEDELTA_lm_m_multiply? undeclared here (not in a function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: (near initialization for ?multiply_functions[19]?) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: ?TIMEDELTA_ml_m_true_divide? undeclared here (not in a > function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: (near initialization for ?true_divide_functions[17]?) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:68: > error: ?TIMEDELTA_ml_m_divide? undeclared here (not in a function) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: ?TIMEDELTA_ml_m_floor_divide? undeclared here (not in a > function) > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: initializer element is not constant > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:89: > error: (near initialization for ?floor_divide_functions[17]?) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: ?TIMEDELTA_ml_m_multiply? undeclared here (not in a function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: (near initialization for ?multiply_functions[18]?) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: ?TIMEDELTA_lm_m_multiply? undeclared here (not in a function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:170: > error: (near initialization for ?multiply_functions[19]?) > In file included from numpy/core/src/umath/umathmodule.c.src:41, > from numpy/core/src/umath/umathmodule_onefile.c:4: > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: ?TIMEDELTA_ml_m_true_divide? undeclared here (not in a > function) > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: initializer element is not constant > > build/src.linux-x86_64-2.6/numpy/core/include/numpy/__umath_generated.c:236: > error: (near initialization for ?true_divide_functions[17]?) > error: Command "gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv > -O2 -Wall -Wstrict-prototypes -fPIC > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -Inumpy/core/include > -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray > -Inumpy/core/src/umath -Inumpy/core/include -I/usr/include/python2.6 > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c > numpy/core/src/umath/umathmodule_onefile.c -o > build/temp.linux-x86_64-2.6/numpy/core/src/umath/umathmodule_onefile.o" > failed with exit status 1 > -- > AJC McMorland > Post-doctoral research fellow > Neurobiology, University of Pittsburgh > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 11:49:30 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 10:49:30 -0500 Subject: [Numpy-discussion] numpy c api general array handling In-Reply-To: <201106080758.p587wAP7032347@lotus.yokuts.org> References: <201106080758.p587wAP7032347@lotus.yokuts.org> Message-ID: I believe what you may want is PyArray_ContiguousFromAny instead of PyArray_ContiguousFromObject. Cheers, Mark On Wed, Jun 8, 2011 at 2:58 AM, Yoshi Rokuko wrote: > hey, > > i'm writing my first python module in c using numpy/arrayobject.h > and i have some problems with different platforms (both linux but > really different setup) so i suspect my array handling is not cor- > rect. > > i'm used to c arrays and want to access large numpy arrays from > within my c module with no strange array-iter-methods. So what are > your remarks to the following: > > PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); > index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, > PyArray_INT, 1, 1); > ix = (int *)index->data; > > then using like: > > for(k = ix[i]; k < ix[i+1]; k++) l = ix[k]; > > this seems to work pretty well, but real problems come with: > PyArg_ParseTuple(args, "O", &arg); > adjc = (PyArrayObject *) PyArray_ContiguousFromObject(arg, > PyArray_INT, 2, 2); > a = (int *)adjc->data; > > and then using like: > > aik = a[i + n * k]; > > it seems like on one system it does not accept 2d numpy arrays just > 1d ones or i must hand him a list of 1d numpy arrays like that: > > >>> A = np.array([[1,0],[0,1]]) > >>> B = my.method(list(A)) > > i would prefer to avoid the list() call in python. > > what are your remarks on performance would it be faster to do: > > PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); > index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, > PyArray_INT, 1, 1); > ix = malloc(n * sizeof(int)); > for(i = 0; i < n; i++) > ix[i] = (int *)index->data[i]; > > and then use ix[] (i call a lot on ix). > > thank you and best regards, yoshi > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott.sinclair.za at gmail.com Wed Jun 8 12:03:38 2011 From: scott.sinclair.za at gmail.com (Scott Sinclair) Date: Wed, 8 Jun 2011 18:03:38 +0200 Subject: [Numpy-discussion] git source datetime build error In-Reply-To: References: Message-ID: On 8 June 2011 17:27, Mark Wiebe wrote: > I've committed a fix for x86_64 as well now. Sorry for the breakage! Works for me. (numpy-master-2.7)scott at godzilla:~$ python -c "import numpy; numpy.test()" Running unit tests for numpy NumPy version 2.0.0.dev-76ca55f NumPy is installed in /home/scott/.virtualenvs/numpy-master-2.7/local/lib/python2.7/site-packages/numpy Python version 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] nose version 0.11.4 ........... ---------------------------------------------------------------------- Ran 3212 tests in 14.816s OK (KNOWNFAIL=3, SKIP=6) Cheers, Scott From Chris.Barker at noaa.gov Wed Jun 8 12:57:01 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Wed, 08 Jun 2011 09:57:01 -0700 Subject: [Numpy-discussion] numpy c api general array handling In-Reply-To: References: <201106080758.p587wAP7032347@lotus.yokuts.org> Message-ID: <4DEFA9DD.6000608@noaa.gov> Mark Wiebe wrote: > I believe what you may want is PyArray_ContiguousFromAny instead of > PyArray_ContiguousFromObject. I would also strongly encourage you to take a look at Cython: http://cython.org/ It has built-in support for numpy arrays, so it can take care of a lot of this bookkeeping for you. http://wiki.cython.org/tutorials/numpy -Chris > Cheers, > Mark > > On Wed, Jun 8, 2011 at 2:58 AM, Yoshi Rokuko > wrote: > > hey, > > i'm writing my first python module in c using numpy/arrayobject.h > and i have some problems with different platforms (both linux but > really different setup) so i suspect my array handling is not cor- > rect. > > i'm used to c arrays and want to access large numpy arrays from > within my c module with no strange array-iter-methods. So what are > your remarks to the following: > > PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); > index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, > PyArray_INT, 1, 1); > ix = (int *)index->data; > > then using like: > > for(k = ix[i]; k < ix[i+1]; k++) l = ix[k]; > > this seems to work pretty well, but real problems come with: > PyArg_ParseTuple(args, "O", &arg); > adjc = (PyArrayObject *) PyArray_ContiguousFromObject(arg, > PyArray_INT, 2, 2); > a = (int *)adjc->data; > > and then using like: > > aik = a[i + n * k]; > > it seems like on one system it does not accept 2d numpy arrays just > 1d ones or i must hand him a list of 1d numpy arrays like that: > > >>> A = np.array([[1,0],[0,1]]) > >>> B = my.method(list(A)) > > i would prefer to avoid the list() call in python. > > what are your remarks on performance would it be faster to do: > > PyArg_ParseTuple(args, "OO|ii", &arg1, &arg2, &start, &stop); > index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, > PyArray_INT, 1, 1); > ix = malloc(n * sizeof(int)); > for(i = 0; i < n; i++) > ix[i] = (int *)index->data[i]; > > and then use ix[] (i call a lot on ix). > > thank you and best regards, yoshi > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > ------------------------------------------------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From bsouthey at gmail.com Wed Jun 8 14:04:03 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 08 Jun 2011 13:04:03 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: Message-ID: <4DEFB993.9050306@gmail.com> On 06/08/2011 10:26 AM, Mark Wiebe wrote: > On Tue, Jun 7, 2011 at 11:52 PM, Fernando Perez @gmail.com > wrote: > > On Tue, Jun 7, 2011 at 4:35 PM, Mark Wiebe > wrote: > > I went ahead and did the merge today as I said I wanted to, that > pull > > request is some further development for someone to code-review > if they have > > time. > > I'm curious as to why there was a need to push ahead with the merge > right away, without giving the original pull request more time for > feedback? If I'm not mistaken, the big merge was this PR: > > https://github.com/numpy/numpy/pull/83 > > and it was just opened a few days ago, containing a massive amount of > work, and so far had only received some feedback from charris, > explicitly requesting a little more breakdown to make digesting it > easier. > > > This is all true, I'll try to explain what my thought process was in > doing the merge. This set of changes basically takes the codebase from > a basically unusable datetime to a good starting point for all the > design discussions that we've been having on the mailing list. I > opened the pull request with about half of the changes that were there > to try and get some feedback, and kept developing on a datetime-fixes2 > branch. When I reached the point that I later merged in, nobody had > responded so I added the new commits to the same pull request. > > Perhaps I should have more patience with git history-editing and > revisiting the development history, but I've basically come to the > conclusion that it's not worth the effort except on relatively minor > things, so I want to rather focus on the code and its design instead > of the particular series of commits that produced it. In doing those > commits, I had to repeatedly double back and implement new missing > functionality before returning to and finishing up what I was working on. > > For the development I'm doing now, which is related to the multitude > of design discussions, I'm splitting it up into more topical branches > partially because the work I merged provides a solid foundation for > doing so, and because these are things diverging from the NEP instead > of being things I perceived as having already had a discussion process > during the NEP formation. > > I tried to see if github would let me do a "dependent" pull request, > but it just included the commits from the branch my later development > was sitting on, and that's probably the main reason I wanted to do a > post-commit style review for this set of changes instead of > pre-commit. I wrote this email to try and communicate my transition > from pre-commit to post-commit review, but I think my wording about > this probably wasn't clear. > > I realize that I'm not really an active numpy contributor in any > significant way, and I see that you've put a ton of work into this, > including a very detailed and impressive discussion on the list on > with multiple people. So my opinion is just that of a user, not > really a core numpy developer. > > > I think your opinion is much more than that, particularly since you're > actively working on closely related projects using the same community > infrastructure. > > But it seems to me that part of having numpy be a better > community-driven project is precisely achieved by having the patience > to allow others to provide feedback and testing, even if they aren't > 100% experts. And one thing that github really shines at, is making > the review/feedback process about as painless as possible (I actually > find it kind of fun). > > > Definitely true (except for not having dependent pull requests, unless > my search was too shallow...). That pull request also has nothing to > do with the discussion we're currently having, it's more of a > prerequisite, so anyone who is following the discussion and wants to > dig in and review the code I'm doing related to that discussion will > be lost in a swamp. By merging this prerequisite, and introducing > separated, cleaner pull requests on that base that are directly from > issues being discussed, this kind of community collaboration is much > more likely to happen. I've simply done a bad job of communicating > this, and as I'm doing the things we're discussing I'll try and tie > these different elements better to encourage the ideals you're describing. > > For example, with this merge, numpy HEAD right now won't even compile > on x86_64, something that would easily have been caught with a bit > more review, especially since it's so easy to test (even I can do > that). It's been a long time since we had a situation where numpy > didn't cleanly at least build from HEAD, so if nothing else, it's a > (small) sign that this particular merge could have used a few more > eyes... > > > I apologize for that, I've grown accustomed to having little to no > review to my pull requests, except from Chuck whose time and effort > I've greatly appreciated, and has significantly improved the > contributions I've made. The only active C-level NumPy development > currently appears to be what I'm doing, and the great > clean-up/extension proposals that Chuck has emailed about. I would > like it if the buildbot system worked better to let me automatically > trigger some build/tests on a variety of platforms before merging a > branch, but it is as it is. > > I realize that it's sometimes frustrating to have a lot of code > sitting in review, and I know that certain efforts are large and > self-contained enough that it's impractical to expect a detailed > line-by-line review. We've had a few such monster branches in ipython > in the past, but at least in those cases we've always tried to ensure > several core people (over skype if needed) have a chance to go over > the entire big picture, discuss the main details with the author so > they can know what to focus on from the large merge, and run the tests > in as many scenarios as is realistic. And so far we haven't really > had any problems with this approach, even if it does require a little > more patience in seeing that (often very high quality) work make it to > the mainline. > > > That approach sounds great, and NumPy needs more active core developers! > > Cheers, > Mark > > > Regards, > > f > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion I am sorry but github pull requests do not appear to be sent to the numpy dev list. So you are not going to get many people to respond to that type of 'closed' request. Further any discussion for things that get merged into the master really should be on the list especially as many people do extensive testing. Bug fixes probably do not need further notification but feature additions or API/ABI changes should have wider notification. So an email to the list would be greatly appreciated so that interested people can track the request and any discussions there. Then, depending on the nature of the request, a second email that notifies that the request will be merged. I can understand Windows failures because not that many people build under Windows but build failures under Linux are rather hard to understand. If you do not test Python 2.4, 2.5, 2.6, 2.7, 3.1 and 3.2 with the supported operating systems (mainly 32-bit and 64-bit Linux, Mac and Windows) then you must let those people who can and give them time to build and test it. That is really true when you acknowledged that you broke one of the 'one of the datetime API functions'. Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbelson at princeton.edu Wed Jun 8 14:20:16 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Wed, 8 Jun 2011 14:20:16 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: Hello, I'm parallelizing some code I've written using the built in multiprocessing module. In my application, I need to multiply many large arrays together and sum the resulting product arrays (inner products). I noticed that when I parallelized this with myPool.map(...) with 8 processes (on an 8-core machine), the code was actually about an order of magnitude slower. I realized that this only happens when I use the numpy array multiplication. I defined my own array multiplication and summation. As expected, this is much slower than the numpy versions, but when I parallelized it I saw the roughly 8x speedup I expected. So something about numpy array multiplication prevented me from speeding it up with multiprocessing. Is there a way to speed up numpy array multiplication with multiprocessing? I noticed that numpy's SVD uses all of the available cores on its own, does numpy array multiplication do something similar already? I'm copying the code which can reproduce and summarize these results. Simply run: "python shared_mem.py" with myutil.py in the same directory. Thanks, Brandt -- myutil.py -- # Utility functions import numpy as N def numpy_inner_product(snap1,snap2): """ A default inner product for n-dimensional numpy arrays """ return N.sum(snap1*snap2.conj()) def my_inner_product(a,b): ip = 0 for r in range(a.shape[0]): for c in range(a.shape[1]): ip += a[r,c]*b[r,c] return ip def my_random(args): return N.random.random(args) def eval_func_tuple(f_args): """Takes a tuple of a function and args, evaluates and returns result""" return f_args[0](*f_args[1:]) -- shared_mem.py -- import myutil import numpy as N import copy import itertools import multiprocessing import time as T processes = multiprocessing.cpu_count() pool = multiprocessing.Pool(processes=processes) def IPs(): """Find inner products of all arrays with themselves""" arraySize = (300,200) numArrays = 50 arrayList = pool.map(myutil.eval_func_tuple, itertools.izip( itertools.repeat(myutil.my_random,numArrays),itertools.repeat(arraySize,numArrays))) IPs = N.zeros(numArrays) startTime = T.time() for arrayIndex,arrayValue in enumerate(arrayList): IPs[arrayIndex] = myutil.numpy_inner_product(arrayValue, arrayValue) endTime = T.time() - startTime print 'No shared memory, numpy array multiplication took',endTime,'seconds' startTime = T.time() innerProductList = pool.map(myutil.eval_func_tuple, itertools.izip(itertools.repeat(myutil.numpy_inner_product), arrayList, arrayList)) IPs = N.array(innerProductList) endTime = T.time() - startTime print 'Shared memory, numpy array multiplication took',endTime,'seconds' startTime = T.time() for arrayIndex,arrayValue in enumerate(arrayList): IPs[arrayIndex] = myutil.my_inner_product(arrayValue, arrayValue) endTime = T.time() - startTime print 'No shared memory, my array multiplication took',endTime,'seconds' startTime = T.time() innerProductList = pool.map(myutil.eval_func_tuple, itertools.izip(itertools.repeat(myutil.my_inner_product), arrayList, arrayList)) IPs = N.array(innerProductList) endTime = T.time() - startTime print 'Shared memory, my array multiplication took',endTime,'seconds' if __name__ == '__main__': print 'Using',processes,'processes' IPs() -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbelson at princeton.edu Wed Jun 8 14:24:57 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Wed, 8 Jun 2011 14:24:57 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: I'm attaching my files this time since it looks like the formatting was messed up when I copied. Brandt On Wed, Jun 8, 2011 at 2:20 PM, wrote: > Send NumPy-Discussion mailing list submissions to > numpy-discussion at scipy.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mail.scipy.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at scipy.org > > You can reach the person managing the list at > numpy-discussion-owner at scipy.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. Re: merging datetime progress (Bruce Southey) > 2. Using multiprocessing (shared memory) with numpy array > multiplication (Brandt Belson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 08 Jun 2011 13:04:03 -0500 > From: Bruce Southey > Subject: Re: [Numpy-discussion] merging datetime progress > To: numpy-discussion at scipy.org > Message-ID: <4DEFB993.9050306 at gmail.com> > Content-Type: text/plain; charset="utf-8" > > On 06/08/2011 10:26 AM, Mark Wiebe wrote: > > On Tue, Jun 7, 2011 at 11:52 PM, Fernando Perez > @gmail.com > wrote: > > > > On Tue, Jun 7, 2011 at 4:35 PM, Mark Wiebe > > wrote: > > > I went ahead and did the merge today as I said I wanted to, that > > pull > > > request is some further development for someone to code-review > > if they have > > > time. > > > > I'm curious as to why there was a need to push ahead with the merge > > right away, without giving the original pull request more time for > > feedback? If I'm not mistaken, the big merge was this PR: > > > > https://github.com/numpy/numpy/pull/83 > > > > and it was just opened a few days ago, containing a massive amount of > > work, and so far had only received some feedback from charris, > > explicitly requesting a little more breakdown to make digesting it > > easier. > > > > > > This is all true, I'll try to explain what my thought process was in > > doing the merge. This set of changes basically takes the codebase from > > a basically unusable datetime to a good starting point for all the > > design discussions that we've been having on the mailing list. I > > opened the pull request with about half of the changes that were there > > to try and get some feedback, and kept developing on a datetime-fixes2 > > branch. When I reached the point that I later merged in, nobody had > > responded so I added the new commits to the same pull request. > > > > Perhaps I should have more patience with git history-editing and > > revisiting the development history, but I've basically come to the > > conclusion that it's not worth the effort except on relatively minor > > things, so I want to rather focus on the code and its design instead > > of the particular series of commits that produced it. In doing those > > commits, I had to repeatedly double back and implement new missing > > functionality before returning to and finishing up what I was working on. > > > > For the development I'm doing now, which is related to the multitude > > of design discussions, I'm splitting it up into more topical branches > > partially because the work I merged provides a solid foundation for > > doing so, and because these are things diverging from the NEP instead > > of being things I perceived as having already had a discussion process > > during the NEP formation. > > > > I tried to see if github would let me do a "dependent" pull request, > > but it just included the commits from the branch my later development > > was sitting on, and that's probably the main reason I wanted to do a > > post-commit style review for this set of changes instead of > > pre-commit. I wrote this email to try and communicate my transition > > from pre-commit to post-commit review, but I think my wording about > > this probably wasn't clear. > > > > I realize that I'm not really an active numpy contributor in any > > significant way, and I see that you've put a ton of work into this, > > including a very detailed and impressive discussion on the list on > > with multiple people. So my opinion is just that of a user, not > > really a core numpy developer. > > > > > > I think your opinion is much more than that, particularly since you're > > actively working on closely related projects using the same community > > infrastructure. > > > > But it seems to me that part of having numpy be a better > > community-driven project is precisely achieved by having the patience > > to allow others to provide feedback and testing, even if they aren't > > 100% experts. And one thing that github really shines at, is making > > the review/feedback process about as painless as possible (I actually > > find it kind of fun). > > > > > > Definitely true (except for not having dependent pull requests, unless > > my search was too shallow...). That pull request also has nothing to > > do with the discussion we're currently having, it's more of a > > prerequisite, so anyone who is following the discussion and wants to > > dig in and review the code I'm doing related to that discussion will > > be lost in a swamp. By merging this prerequisite, and introducing > > separated, cleaner pull requests on that base that are directly from > > issues being discussed, this kind of community collaboration is much > > more likely to happen. I've simply done a bad job of communicating > > this, and as I'm doing the things we're discussing I'll try and tie > > these different elements better to encourage the ideals you're > describing. > > > > For example, with this merge, numpy HEAD right now won't even compile > > on x86_64, something that would easily have been caught with a bit > > more review, especially since it's so easy to test (even I can do > > that). It's been a long time since we had a situation where numpy > > didn't cleanly at least build from HEAD, so if nothing else, it's a > > (small) sign that this particular merge could have used a few more > > eyes... > > > > > > I apologize for that, I've grown accustomed to having little to no > > review to my pull requests, except from Chuck whose time and effort > > I've greatly appreciated, and has significantly improved the > > contributions I've made. The only active C-level NumPy development > > currently appears to be what I'm doing, and the great > > clean-up/extension proposals that Chuck has emailed about. I would > > like it if the buildbot system worked better to let me automatically > > trigger some build/tests on a variety of platforms before merging a > > branch, but it is as it is. > > > > I realize that it's sometimes frustrating to have a lot of code > > sitting in review, and I know that certain efforts are large and > > self-contained enough that it's impractical to expect a detailed > > line-by-line review. We've had a few such monster branches in > ipython > > in the past, but at least in those cases we've always tried to ensure > > several core people (over skype if needed) have a chance to go over > > the entire big picture, discuss the main details with the author so > > they can know what to focus on from the large merge, and run the > tests > > in as many scenarios as is realistic. And so far we haven't really > > had any problems with this approach, even if it does require a little > > more patience in seeing that (often very high quality) work make it > to > > the mainline. > > > > > > That approach sounds great, and NumPy needs more active core developers! > > > > Cheers, > > Mark > > > > > > Regards, > > > > f > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > I am sorry but github pull requests do not appear to be sent to the > numpy dev list. So you are not going to get many people to respond to > that type of 'closed' request. Further any discussion for things that > get merged into the master really should be on the list especially as > many people do extensive testing. > > Bug fixes probably do not need further notification but feature > additions or API/ABI changes should have wider notification. So an email > to the list would be greatly appreciated so that interested people can > track the request and any discussions there. Then, depending on the > nature of the request, a second email that notifies that the request > will be merged. > > I can understand Windows failures because not that many people build > under Windows but build failures under Linux are rather hard to > understand. If you do not test Python 2.4, 2.5, 2.6, 2.7, 3.1 and 3.2 > with the supported operating systems (mainly 32-bit and 64-bit Linux, > Mac and Windows) then you must let those people who can and give them > time to build and test it. That is really true when you acknowledged > that you broke one of the 'one of the datetime API functions'. > > Bruce > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110608/d3de4f1a/attachment-0001.html > > ------------------------------ > > Message: 2 > Date: Wed, 8 Jun 2011 14:20:16 -0400 > From: Brandt Belson > Subject: [Numpy-discussion] Using multiprocessing (shared memory) with > numpy array multiplication > To: numpy-discussion at scipy.org > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > I'm parallelizing some code I've written using the built in multiprocessing > module. In my application, I need to multiply many large arrays together > and > sum the resulting product arrays (inner products). I noticed that when I > parallelized this with myPool.map(...) with 8 processes (on an 8-core > machine), the code was actually about an order of magnitude slower. I > realized that this only happens when I use the numpy array multiplication. > I > defined my own array multiplication and summation. As expected, this is > much > slower than the numpy versions, but when I parallelized it I saw the > roughly > 8x speedup I expected. So something about numpy array multiplication > prevented me from speeding it up with multiprocessing. > > Is there a way to speed up numpy array multiplication with multiprocessing? > I noticed that numpy's SVD uses all of the available cores on its own, does > numpy array multiplication do something similar already? > > I'm copying the code which can reproduce and summarize these results. > Simply > run: "python shared_mem.py" with myutil.py in the same directory. > > Thanks, > Brandt > > > -- myutil.py -- > > # Utility functions > import numpy as N > def numpy_inner_product(snap1,snap2): > """ A default inner product for n-dimensional numpy arrays """ > return N.sum(snap1*snap2.conj()) > > def my_inner_product(a,b): > ip = 0 > for r in range(a.shape[0]): > for c in range(a.shape[1]): > ip += a[r,c]*b[r,c] > return ip > > def my_random(args): > return N.random.random(args) > > > def eval_func_tuple(f_args): > """Takes a tuple of a function and args, evaluates and returns result""" > return f_args[0](*f_args[1:]) > > > -- shared_mem.py -- > > > import myutil > import numpy as N > import copy > import itertools > import multiprocessing > import time as T > processes = multiprocessing.cpu_count() > pool = multiprocessing.Pool(processes=processes) > > def IPs(): > """Find inner products of all arrays with themselves""" > arraySize = (300,200) > numArrays = 50 > arrayList = pool.map(myutil.eval_func_tuple, itertools.izip( > > > itertools.repeat(myutil.my_random,numArrays),itertools.repeat(arraySize,numArrays))) > > IPs = N.zeros(numArrays) > > startTime = T.time() > for arrayIndex,arrayValue in enumerate(arrayList): > IPs[arrayIndex] = myutil.numpy_inner_product(arrayValue, arrayValue) > endTime = T.time() - startTime > print 'No shared memory, numpy array multiplication > took',endTime,'seconds' > > startTime = T.time() > innerProductList = pool.map(myutil.eval_func_tuple, > itertools.izip(itertools.repeat(myutil.numpy_inner_product), > arrayList, arrayList)) > IPs = N.array(innerProductList) > endTime = T.time() - startTime > print 'Shared memory, numpy array multiplication took',endTime,'seconds' > > startTime = T.time() > for arrayIndex,arrayValue in enumerate(arrayList): > IPs[arrayIndex] = myutil.my_inner_product(arrayValue, arrayValue) > endTime = T.time() - startTime > print 'No shared memory, my array multiplication took',endTime,'seconds' > > startTime = T.time() > innerProductList = pool.map(myutil.eval_func_tuple, > itertools.izip(itertools.repeat(myutil.my_inner_product), > arrayList, arrayList)) > IPs = N.array(innerProductList) > endTime = T.time() - startTime > print 'Shared memory, my array multiplication took',endTime,'seconds' > > > if __name__ == '__main__': > print 'Using',processes,'processes' > IPs() > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110608/19d0c6b0/attachment.html > > ------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > End of NumPy-Discussion Digest, Vol 57, Issue 36 > ************************************************ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: shared_mem.py Type: application/octet-stream Size: 1797 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: myutil.py Type: application/octet-stream Size: 534 bytes Desc: not available URL: From mwwiebe at gmail.com Wed Jun 8 14:55:32 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 13:55:32 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: <4DEFB993.9050306@gmail.com> References: <4DEFB993.9050306@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 1:04 PM, Bruce Southey wrote: > ** > > > > I am sorry but github pull requests do not appear to be sent to the numpy > dev list. So you are not going to get many people to respond to that type of > 'closed' request. Further any discussion for things that get merged into the > master really should be on the list especially as many people do extensive > testing. > This is a good point, would it be possible to add numpy-discussion as a collaborator in the NumPy github repository, so it would get those emails too? The number of pull requests is relatively small, so this wouldn't add a great deal more traffic. > Bug fixes probably do not need further notification but feature additions > or API/ABI changes should have wider notification. So an email to the list > would be greatly appreciated so that interested people can track the request > and any discussions there. Then, depending on the nature of the request, a > second email that notifies that the request will be merged. > I was attempting to provide reasonable notification, but I see the pull request vs numpy-discussion email is an issue. I can understand Windows failures because not that many people build under > Windows but build failures under Linux are rather hard to understand. If you > do not test Python 2.4, 2.5, 2.6, 2.7, 3.1 and 3.2 with the supported > operating systems (mainly 32-bit and 64-bit Linux, Mac and Windows) then you > must let those people who can and give them time to build and test it. That > is really true when you acknowledged that you broke one of the 'one of the > datetime API functions'. > Requiring that amount of overhead before committing a change into master, which is the unstable development branch, sounds very unreasonable to me. I really wish the buildbot system or something equivalent could provide continuous integration testing with minimal developer effort. I also think the default monolithic build mode in setup.py should be removed, and the multi-file approach should just be used, to reduce the number of possible build configurations. I don't want to get sucked into the distutils vortex, however, someone else needs to volunteer to make a change like this. I would prefer for it to be relatively easy for a change to go into master, with quick responses when things break. That's what I'm trying to provide for the problem reports being sent to the list. There's the 1.6.x branch available for people who want stability with the latest bug-fixes. The datetime API already received some discussion near the start of the long datetime thread, with the general response being to mark the datetime functionality as experimental with big warning signs. My interpretation of this is that it is about releases, not the unstable development branch, and view it the same way as when NumPy master was ABI-incompatible until it was decided to fix that approaching the 1.6 release. In this case, the ABI is still compatible with 1.6, and I suspect the datetime API function in question doesn't work as would be expected in 1.6 either. My experience with the new-iterator branch was that while I got some feedback while it was in a branch, only merging into master produced the feedback necessary to get it properly stable. Anyone that's pulling from master is participating in the development process, which can occasionally involve broken builds, crashes, and bugs, and feedback from people doing that is invaluable. I really hope nobody is automatically updating from master in a production system anywhere, if they are they should stop... -Mark > > Bruce > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Jun 8 15:55:05 2011 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 8 Jun 2011 14:55:05 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: <4DEFB993.9050306@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 13:55, Mark Wiebe wrote: > Requiring that amount of overhead before committing a change into master, > which is the unstable development branch, sounds very unreasonable to me. Ah, that's the source of the misunderstanding, then. master is not the unstable development branch. According to the workflow that most numpy developers have agreed upon[1], master is an integration branch. The ideal situation is that you do work on feature branches and merge them in when they are complete, agreed upon, and tested. Ideally, master should always build cleanly and always pass the test suite. Developers need a clean, working master to branch from too, not just production users. [1] http://docs.scipy.org/doc/numpy/dev/gitwash/index.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From rayvroberts at gmail.com Wed Jun 8 16:07:56 2011 From: rayvroberts at gmail.com (Raymond Roberts) Date: Wed, 8 Jun 2011 15:07:56 -0500 Subject: [Numpy-discussion] import numpy fails on dev build Message-ID: numpy.__version__ == 2.0.0.dev-76ca55f python version 2.7.1 Gentoo Linux 2.6.38-r6 installed inside of VMWare fusion on Mac OS X 10.6.7 gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5 Build is successful. Import of numpy dev build results in the following error message ImportError: ~/numpy/lib/python/numpy/core/multiarray.so: undefined symbol: _PyUnicodeUCS4_IsWhitespace Captured the output of build and nowhere is there an error or warning related to _PyUnicodeUCS4_IsWhitespace. -Ray -------------- next part -------------- An HTML attachment was scrubbed... URL: From ischnell at enthought.com Wed Jun 8 16:12:02 2011 From: ischnell at enthought.com (Ilan Schnell) Date: Wed, 8 Jun 2011 15:12:02 -0500 Subject: [Numpy-discussion] import numpy fails on dev build In-Reply-To: References: Message-ID: This is due to a mismatch of the UCS mode of Python and the C extension you build. Python has a configure option to let you choose whether unicode symbols are represented as 2 of 4 bytes, i.e. UCS2 or UCS4. - Ilan On Wed, Jun 8, 2011 at 3:07 PM, Raymond Roberts wrote: > numpy.__version__ ==?2.0.0.dev-76ca55f > python version 2.7.1 > Gentoo Linux 2.6.38-r6 installed inside of VMWare fusion on Mac OS X 10.6.7 > gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5 > Build is successful. Import of numpy dev build results in the following > error message > ImportError: ~/numpy/lib/python/numpy/core/multiarray.so: undefined symbol: > _PyUnicodeUCS4_IsWhitespace > Captured the output of build and nowhere is there an error or warning > related to?_PyUnicodeUCS4_IsWhitespace. > > > -Ray > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From bsouthey at gmail.com Wed Jun 8 16:23:39 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 8 Jun 2011 15:23:39 -0500 Subject: [Numpy-discussion] merging datetime progress In-Reply-To: References: <4DEFB993.9050306@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 1:55 PM, Mark Wiebe wrote: > On Wed, Jun 8, 2011 at 1:04 PM, Bruce Southey wrote: >> >> >> >> I am sorry but github pull requests do not appear to be sent to the numpy >> dev list. So you are not going to get many people to respond to that type of >> 'closed' request. Further any discussion for things that get merged into the >> master really should be on the list especially as many people do extensive >> testing. > > This is a good point, would it be possible to add numpy-discussion as a > collaborator in the NumPy github repository, so it would get those emails > too? The number of pull requests is relatively small, so this wouldn't add a > great deal more traffic. > >> >> Bug fixes probably do not need further notification but feature additions >> or API/ABI changes should have wider notification. So an email to the list >> would be greatly appreciated so that interested people can track the request >> and any discussions there. Then, depending on the nature of the request, a >> second email that notifies that the request will be merged. > > I was attempting to provide reasonable notification, but I see the pull > request vs numpy-discussion email is an issue. >> >> I can understand Windows failures because not that many people build under >> Windows but build failures under Linux are rather hard to understand. If you >> do not test Python 2.4, 2.5, 2.6, 2.7, 3.1 and 3.2 with the supported >> operating systems (mainly 32-bit and 64-bit Linux, Mac and Windows) then you >> must let those people who can and give them time to build and test it. That >> is really true when you acknowledged that you broke one of the 'one of the >> datetime API functions'. > > Requiring that amount of overhead before committing a change into master, > which is the unstable development branch, sounds very unreasonable to me. I > really wish the buildbot system or something equivalent could provide > continuous integration testing with minimal developer effort. I also think > the default monolithic build mode in setup.py should be removed, and the > multi-file approach should just be used, to reduce the number of possible > build configurations. I don't want to get sucked into the distutils vortex, > however, someone else needs to volunteer to make a change like this. > I would prefer for it to be relatively easy for a change to go into master, > with quick responses when things break. That's what I'm trying to provide > for the problem reports being sent to the list. There's the 1.6.x branch > available for people who want stability with the latest bug-fixes. > The datetime API already received some discussion near the start of the long > datetime thread, with the general response being to mark the datetime > functionality as experimental with big warning signs. My interpretation of > this is that it is about releases, not the unstable development branch, and > view it the same way as when NumPy master was ABI-incompatible until it was > decided to fix that approaching the 1.6 release. In this case, the ABI is > still compatible with 1.6, and I suspect the datetime API function in > question doesn't work as would be expected in 1.6 either. > My experience with the new-iterator branch was that while I got some > feedback while it was in a branch, only merging into master produced the > feedback necessary to get it properly stable. Anyone that's pulling from > master is participating in the development process, which can occasionally > involve broken builds, crashes, and bugs, and feedback from people doing > that is invaluable. I really hope nobody is automatically updating from > master in a production system anywhere, if they are they should stop... > -Mark > >> >> Bruce >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > I do understand and really do appreciate the effort that your putting in. Thanks Bruce From mwwiebe at gmail.com Wed Jun 8 17:05:29 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 16:05:29 -0500 Subject: [Numpy-discussion] Default unit for datetime/timedelta Message-ID: The NEP and current implementation of the datetime specifies microseconds as the default unit when constructing and converting to datetimes and timedeltas. Having the np.arange function work in a general, intuitive way for a wide variety of datetime/timedelta/integer inputs is turning out to be very complicated, and has made me realized that adding a "generic time unit" which gets assigned to the data type when unspecified would simplify this code and also make it more consistent with the behavior of +/- operations. Here are some current behaviors that are inconsistent with the microsecond default, but consistent with the "generic time unit" idea: >>> np.timedelta64(10, 's') + 10 numpy.timedelta64(20,'s') >>> np.datetime64('2011-03-12') + 3 numpy.datetime64('2011-03-15','D') I'd like to make 'M8' and 'm8' be datetime data types with generic time units instead of microseconds as they are currently. This would also allow the possibility of extending the behavior of detecting the unit from the input string as: >>> np.datetime64('2011-03-12T13') numpy.datetime64('2011-03-12T13-0600','h') to also work with arrays, which currently work like this: >>> np.array(['2011-03-12T13', '2012'], dtype='M8') array(['2011-03-12T13:00:00.000000-0600', '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') This is similar to some cases of the string data type, where the size of the string is detected based on the input. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From akshar.bhosale at gmail.com Wed Jun 8 17:21:04 2011 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Thu, 9 Jun 2011 02:51:04 +0530 Subject: [Numpy-discussion] numpy : error with cc compiler Message-ID: Hi, we have RHEL 5.2 x86_64, python 2.6, intel compilers 11.0.69, intel mkl 10.1. We are trying to configure numpy 1.6.0 and we are getting below errors. our site.cfg is : ------------------------- [mkl] #mkl_libs = mkl_blacs_intelmpi_ilp64, mkl_blacs_intelmpi_lp64, mkl_core, mkl_def, mkl_gf_ilp64, mkl_gf_lp64, mkl_gnu_thread, mkl_intel_ilp64, mkl_intel_lp64, mkl_intel_sp2dp, mkl_intel_thread, mkl_lapack, mkl_mc3, mkl_mc, mkl_p4n, mkl_scalapack_ilp64, mkl_scalapack_lp64, mkl_sequential, mkl, iomp, guide, mkl_vml_def, mkl_vml_mc2, mkl_vml_mc3, mkl_vml_mc, mkl_vml_p4n,mkl_def, mkl_intel_lp64, mkl_intel_thread, mkl_core, mkl_mc mkl_libs = mkl_def, mkl_intel_lp64, mkl_intel_thread, mkl_core, mkl_mc lapack_libs = mkl_lapack95_lp64 #lapack_libs =mkl_lapack,mkl_scalapack_ilp64,mkl_scalapack_lp64,mkl_lapack95_lp64 library_dirs = /opt/intel/Compiler/11.0/069/mkl/lib/em64t:/opt/intel/Compiler/11.0/069/lib/intel64/ include_dirs = /opt/intel/Compiler/11.0/069/mkl/include:/opt/intel/Compiler/11.0/069/include/ ------------------------- We are not able to understand what all flags to include with icc and ifort in intelccompiler.py and intel.py. Also what ll libraries to link for above configuration is not given. Can somebody guide for with proper guidelines so that installation goes smooth. ERRORS : DeprecationWarning) C compiler: cc compile options: '-Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/home/external/unipune/gadre/Python-2.6/Include -I/home/external/unipune/gadre/Python-2.6 -c ' cc: _configtest.c cc _configtest.o -o _configtest _configtest failure. removing: _configtest.c _configtest.o _configtest building data_files sources build_src: building npy-pkg config files customize IntelCCompiler customize IntelCCompiler using build_clib building 'npymath' library compiling C sources C compiler: cc creating build/temp.linux-x86_64-2.6 creating build/temp.linux-x86_64-2.6/build creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6 creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/npymath creating build/temp.linux-x86_64-2.6/numpy creating build/temp.linux-x86_64-2.6/numpy/core creating build/temp.linux-x86_64-2.6/numpy/core/src creating build/temp.linux-x86_64-2.6/numpy/core/src/npymath compile options: '-Inumpy/core/include -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/home/external/un ipune/gadre/Python-2.6/Include -I/home/external/unipune/gadre/Python-2.6 -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' cc: numpy/core/src/npymath/halffloat.c cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/npy_math_complex.c cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/npy_math.c cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/ieee754.c ar: adding 4 object files to build/temp.linux-x86_64-2.6/libnpymath.a running build_ext customize IntelCCompiler customize IntelCCompiler using build_ext building 'numpy.core._sort' extension compiling C sources C compiler: cc compile options: '-Inumpy/core/include -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/include -I/home/external/un ipune/gadre/Python-2.6/Include -I/home/external/unipune/gadre/Python-2.6 -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' cc: build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.c creating build/lib.linux-x86_64-2.6 creating build/lib.linux-x86_64-2.6/numpy creating build/lib.linux-x86_64-2.6/numpy/core cc -shared build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o -Lbuild/temp.linux-x86_64-2.6 -lnpymath -lm -o build/lib.linux-x86_64-2.6/numpy/core/_sort.so /usr/bin/ld: build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: could not read symbols: Bad value collect2: ld returned 1 exit status /usr/bin/ld: build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: could not read symbols: Bad value collect2: ld returned 1 exit status error: Command "cc -shared build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o -Lbuild/temp.linux-x86_64-2.6 -lnpymath -lm -o build/lib.linux-x86_64-2.6/numpy/core/_sort.so" failed with exit status 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 17:34:49 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 16:34:49 -0500 Subject: [Numpy-discussion] Current datetime pull requests ready for review Message-ID: I've been trying to separate out the features I've been doing into separate branches so I can split up into multiple pull, but the development I'm doing works pretty poorly with this workflow. Nearly every step of development builds on or uses changes from previous features. Since github doesn't do dependent pull requests, I'm going to go back to developing in a single branch to merge into master periodically. There are currently two pull requests ready to go, which could use some review: "Datetime unit from string" https://github.com/numpy/numpy/pull/85 This makes conversion from a string to a datetime scalar automatically detect the unit if it's not specified. To support the same idea for arrays will require adding generic time units as I described in another email. "Unify datetime/timedelta type promotion" https://github.com/numpy/numpy/pull/86 This simplifies the rules of how datetimes and timedeltas combine to always take the gcd/more precise unit instead of having the various special cases as the NEP described. The motivation for this change was based on how restricting those rules felt after spending some time experimenting with datetime calculations. Please review! Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Wed Jun 8 17:38:02 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Wed, 8 Jun 2011 23:38:02 +0200 Subject: [Numpy-discussion] numpy : error with cc compiler In-Reply-To: References: Message-ID: On Wed, Jun 8, 2011 at 11:21 PM, akshar bhosale wrote: > Hi, > > we have RHEL 5.2 x86_64, python 2.6, intel compilers 11.0.69, intel mkl > 10.1. We are trying to configure numpy 1.6.0 and we are getting below > errors. > our site.cfg is : > ------------------------- > [mkl] > #mkl_libs = mkl_blacs_intelmpi_ilp64, mkl_blacs_intelmpi_lp64, mkl_core, > mkl_def, mkl_gf_ilp64, mkl_gf_lp64, mkl_gnu_thread, mkl_intel_ilp64, > mkl_intel_lp64, mkl_intel_sp2dp, mkl_intel_thread, mkl_lapack, mkl_mc3, > mkl_mc, mkl_p4n, mkl_scalapack_ilp64, mkl_scalapack_lp64, mkl_sequential, > mkl, iomp, guide, mkl_vml_def, mkl_vml_mc2, mkl_vml_mc3, mkl_vml_mc, > mkl_vml_p4n,mkl_def, mkl_intel_lp64, mkl_intel_thread, mkl_core, mkl_mc > > > mkl_libs = mkl_def, mkl_intel_lp64, mkl_intel_thread, mkl_core, mkl_mc > lapack_libs = mkl_lapack95_lp64 > > #lapack_libs > =mkl_lapack,mkl_scalapack_ilp64,mkl_scalapack_lp64,mkl_lapack95_lp64 > library_dirs = > /opt/intel/Compiler/11.0/069/mkl/lib/em64t:/opt/intel/Compiler/11.0/069/lib/intel64/ > include_dirs = > /opt/intel/Compiler/11.0/069/mkl/include:/opt/intel/Compiler/11.0/069/include/ > ------------------------- > We are not able to understand what all flags to include with icc and ifort > in intelccompiler.py and intel.py. Also what ll libraries to link for above > configuration is not given. Can somebody guide for with proper guidelines so > that installation goes smooth. > > ERRORS : > > DeprecationWarning) > C compiler: cc > > compile options: '-Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath > -Inumpy/core/include -I/home/external/unipune/gadre/Python-2.6/Include > -I/home/external/unipune/gadre/Python-2.6 -c > ' > cc: _configtest.c > cc _configtest.o -o _configtest > _configtest > failure. > removing: _configtest.c _configtest.o _configtest > building data_files sources > build_src: building npy-pkg config files > customize IntelCCompiler > customize IntelCCompiler using build_clib > building 'npymath' library > compiling C sources > C compiler: cc > > creating build/temp.linux-x86_64-2.6 > creating build/temp.linux-x86_64-2.6/build > creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6 > creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy > creating build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core > creating > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src > creating > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/npymath > creating build/temp.linux-x86_64-2.6/numpy > creating build/temp.linux-x86_64-2.6/numpy/core > creating build/temp.linux-x86_64-2.6/numpy/core/src > creating build/temp.linux-x86_64-2.6/numpy/core/src/npymath > compile options: '-Inumpy/core/include > -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath > -Inumpy/core/include -I/home/external/un > ipune/gadre/Python-2.6/Include -I/home/external/unipune/gadre/Python-2.6 > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' > cc: numpy/core/src/npymath/halffloat.c > cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/npy_math_complex.c > cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/npy_math.c > cc: build/src.linux-x86_64-2.6/numpy/core/src/npymath/ieee754.c > ar: adding 4 object files to build/temp.linux-x86_64-2.6/libnpymath.a > running build_ext > customize IntelCCompiler > customize IntelCCompiler using build_ext > building 'numpy.core._sort' extension > compiling C sources > C compiler: cc > > compile options: '-Inumpy/core/include > -Ibuild/src.linux-x86_64-2.6/numpy/core/include/numpy > -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core > -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath > -Inumpy/core/include -I/home/external/un > ipune/gadre/Python-2.6/Include -I/home/external/unipune/gadre/Python-2.6 > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/multiarray > -Ibuild/src.linux-x86_64-2.6/numpy/core/src/umath -c' > cc: build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.c > creating build/lib.linux-x86_64-2.6 > creating build/lib.linux-x86_64-2.6/numpy > creating build/lib.linux-x86_64-2.6/numpy/core > cc -shared > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o > -Lbuild/temp.linux-x86_64-2.6 -lnpymath -lm -o > build/lib.linux-x86_64-2.6/numpy/core/_sort.so > /usr/bin/ld: > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: > relocation R_X86_64_32 against `a local symbol' can not be used when making > a shared object; recompile with -fPIC > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: > could not read symbols: Bad value > collect2: ld returned 1 exit status > /usr/bin/ld: > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: > relocation R_X86_64_32 against `a local symbol' can not be used when making > a shared object; recompile with -fPIC > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o: > could not read symbols: Bad value > collect2: ld returned 1 exit status > error: Command "cc -shared > build/temp.linux-x86_64-2.6/build/src.linux-x86_64-2.6/numpy/core/src/_sortmodule.o > -Lbuild/temp.linux-x86_64-2.6 -lnpymath -lm -o > build/lib.linux-x86_64-2.6/numpy/core/_sort.so" failed with exit status 1 > > The compiler is giving you some friendly advice above: recompile with -fPIC. Did you try that? Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Wed Jun 8 18:09:12 2011 From: ben.root at ou.edu (Benjamin Root) Date: Wed, 8 Jun 2011 17:09:12 -0500 Subject: [Numpy-discussion] Current datetime pull requests ready for review In-Reply-To: References: Message-ID: On Wednesday, June 8, 2011, Mark Wiebe wrote: > I've been trying to separate out the features I've been doing into separate branches so I can split up into multiple pull, but the development I'm doing works pretty poorly with this workflow. Nearly every step of development builds on or uses changes from previous features. Since github doesn't do dependent pull requests, I'm going to go back to developing in a single branch to merge into master periodically. > > There are currently two pull requests ready to go, which could use some review: > "Datetime unit from string"?https://github.com/numpy/numpy/pull/85 > > This makes conversion from a string to a datetime scalar automatically detect the unit if it's not specified. To support the same idea for arrays will require adding generic time units as I described in another email. > > "Unify datetime/timedelta type promotion"?https://github.com/numpy/numpy/pull/86 > > This simplifies the rules of how datetimes and timedeltas combine to always take the gcd/more precise unit instead of having the various special cases as the NEP described. The motivation for this change was based on how restricting those rules felt after spending some time experimenting with datetime calculations. > > Please review! > Cheers,Mark > Just a thought. I think one way to do something like dependent pull requests is to develop each feature in it's own branch, and then have a branch that would merge some of those feature branches. These branches would represent the logical dependencies needed, and could be arbitrarially nested. It wouldn't quite be exactly what you are wanting, but maybe something similar could be done with pull requests, since one can continue developing in a branch until the pull is done. Just thinking out loud... Ben Root From pgmdevlist at gmail.com Wed Jun 8 18:48:38 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 9 Jun 2011 00:48:38 +0200 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: References: Message-ID: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> On Jun 8, 2011, at 11:05 PM, Mark Wiebe wrote: > The NEP and current implementation of the datetime specifies microseconds as the default unit when constructing and converting to datetimes and timedeltas. AFAIU, the default is [us] when otherwise unspecified. > Here are some current behaviors that are inconsistent with the microsecond default, but consistent with the "generic time unit" idea: > > >>> np.timedelta64(10, 's') + 10 > numpy.timedelta64(20,'s') Here, the unit is defined: 's' > >>> np.datetime64('2011-03-12') + 3 > numpy.datetime64('2011-03-15','D') OK, here it is not. But the result makes sense... Up to a certain point. If you try to guess the unit from a date given as a string, what happens in case of ambiguities ? Or do you restrict an input string to be strictly ISO8601 to remove those ? > I'd like to make 'M8' and 'm8' be datetime data types with generic time units instead of microseconds as they are currently. This would also allow the possibility of extending the behavior of detecting the unit from the input string as: > > >>> np.datetime64('2011-03-12T13') > numpy.datetime64('2011-03-12T13-0600','h') > > to also work with arrays, which currently work like this: > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > array(['2011-03-12T13:00:00.000000-0600', '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') Why is the second one not '2012-01-01T00:00:00-0600' ? Otherwise, I'm all for it. From mwwiebe at gmail.com Wed Jun 8 19:10:26 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 18:10:26 -0500 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> References: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 5:48 PM, Pierre GM wrote: > > On Jun 8, 2011, at 11:05 PM, Mark Wiebe wrote: > > > The NEP and current implementation of the datetime specifies microseconds > as the default unit when constructing and converting to datetimes and > timedeltas. > > AFAIU, the default is [us] when otherwise unspecified. > That's correct. > Here are some current behaviors that are inconsistent with the microsecond > default, but consistent with the "generic time unit" idea: > > > > >>> np.timedelta64(10, 's') + 10 > > numpy.timedelta64(20,'s') > > Here, the unit is defined: 's' > For the first operand, the inconsistency is with the second. Here's the reasoning I didn't spell out: We're adding a timedelta + int, so lets convert 10 into a timedelta. No units specified, so it's 10 microseconds, so we add 10 seconds and 10 microseconds, not 10 seconds and 10 seconds. This intuitive behavior which was specified in the NEP for + follows naturally from having generic units, but not from having a default of microseconds. > >>> np.datetime64('2011-03-12') + 3 > > numpy.datetime64('2011-03-15','D') > > OK, here it is not. But the result makes sense... Up to a certain point. If > you try to guess the unit from a date given as a string, what happens in > case of ambiguities ? Or do you restrict an input string to be strictly > ISO8601 to remove those ? > Yeah, I'm restricting the string to be (almost) strictly ISO8601. For supporting other formats, I think creating a 'fancy_date_parser' function or something like that would be better than having all those date string format ambiguities in the core type. > I'd like to make 'M8' and 'm8' be datetime data types with generic time > units instead of microseconds as they are currently. This would also allow > the possibility of extending the behavior of detecting the unit from the > input string as: > > > > >>> np.datetime64('2011-03-12T13') > > numpy.datetime64('2011-03-12T13-0600','h') > > > > to also work with arrays, which currently work like this: > > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > array(['2011-03-12T13:00:00.000000-0600', > '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > > Why is the second one not '2012-01-01T00:00:00-0600' ? > This is because dates are stored at midnight UTC, and when converted to local time for the default time-based printing, that changes slightly. ISO8601 specifies to interpret an input in local time if no "Z" or timezone offset is given, so that's why the first one matches. I haven't been able to think of a way around it other than putting warnings in the documentation, and have made 'today' and 'now' throw errors if you try to use them as times or dates respectively. > > Otherwise, I'm all for it. > Cool. -Mark > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Wed Jun 8 19:31:09 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 9 Jun 2011 01:31:09 +0200 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: References: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> Message-ID: On Jun 9, 2011, at 1:10 AM, Mark Wiebe wrote: > > > >>> np.timedelta64(10, 's') + 10 > > numpy.timedelta64(20,'s') > > Here, the unit is defined: 's' > > For the first operand, the inconsistency is with the second. Here's the reasoning I didn't spell out: > We're adding a timedelta + int, so lets convert 10 into a timedelta. No units specified, so it's > 10 microseconds, so we add 10 seconds and 10 microseconds, not 10 seconds and 10 seconds. Ah OK. I think that your approach of taking the defined unit (here, s) as unit of the undefined term (here, 10) is by far the best. > >OK, here it is not. But the result makes sense... Up to a certain point. If you try to guess the unit from a date given as a >string, what happens in case of ambiguities ? Or do you restrict an input string to be strictly ISO8601 to remove those ? > > Yeah, I'm restricting the string to be (almost) strictly ISO8601. For supporting other formats, I think creating a 'fancy_date_parser' function or something like that would be better than having all those date string format ambiguities in the core type. Quite OK. But this 'fancy_date_parser' will likely crash at some point if the unit cannot be guessed properly. But you're right, that's not the issue here. > > > I'd like to make 'M8' and 'm8' be datetime data types with generic time units instead of microseconds as they are currently. This would also allow the possibility of extending the behavior of detecting the unit from the input string as: > > > > >>> np.datetime64('2011-03-12T13') > > numpy.datetime64('2011-03-12T13-0600','h') > > > > to also work with arrays, which currently work like this: > > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > array(['2011-03-12T13:00:00.000000-0600', '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > > Why is the second one not '2012-01-01T00:00:00-0600' ? > > This is because dates are stored at midnight UTC, and when converted to local time for the default time-based printing, that changes slightly. > ISO8601 specifies to interpret an input in local time if no "Z" or timezone offset is given, so that's why the first one matches. I haven't been able to think of a way around it other than putting warnings in the documentation, and have made 'today' and 'now' throw errors if you try to use them as times or dates respectively. I see the logic, but I don't like it at all. I would expect the date to be stored in the local time zone by default (that is, if no other time zone info is available). From charlesr.harris at gmail.com Wed Jun 8 19:53:27 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 8 Jun 2011 17:53:27 -0600 Subject: [Numpy-discussion] Current datetime pull requests ready for review In-Reply-To: References: Message-ID: On Wed, Jun 8, 2011 at 4:09 PM, Benjamin Root wrote: > On Wednesday, June 8, 2011, Mark Wiebe wrote: > > I've been trying to separate out the features I've been doing into > separate branches so I can split up into multiple pull, but the development > I'm doing works pretty poorly with this workflow. Nearly every step of > development builds on or uses changes from previous features. Since github > doesn't do dependent pull requests, I'm going to go back to developing in a > single branch to merge into master periodically. > > > > There are currently two pull requests ready to go, which could use some > review: > > "Datetime unit from string" https://github.com/numpy/numpy/pull/85 > > > > This makes conversion from a string to a datetime scalar automatically > detect the unit if it's not specified. To support the same idea for arrays > will require adding generic time units as I described in another email. > > > > "Unify datetime/timedelta type promotion" > https://github.com/numpy/numpy/pull/86 > > > > This simplifies the rules of how datetimes and timedeltas combine to > always take the gcd/more precise unit instead of having the various special > cases as the NEP described. The motivation for this change was based on how > restricting those rules felt after spending some time experimenting with > datetime calculations. > > > > Please review! > > Cheers,Mark > > > > Just a thought. I think one way to do something like dependent pull > requests is to develop each feature in it's own branch, and then have > a branch that would merge some of those feature branches. These > branches would represent the logical dependencies needed, and could be > arbitrarially nested. > > It wouldn't quite be exactly what you are wanting, but maybe > something similar could be done with pull requests, since one can > continue developing in a branch until the pull is done. > > Just thinking out loud... > > I don't think it is generally possible to develop in a straight line, breaking things up into a nice sequence can probably only be done after the fact and it takes work. There is a trade off here as to how the scarce resource of developer time gets allocated between the tasks. Mark is a very productive guy so the stuff just piles up and there aren't that many doing review. OTOH, doing review is much easier if the chunks are smaller. Probably a bit more time should to go into producing and polishing the chunks but just how to make that division of labor is probably something we will only arrive at with some experience. As to the current problem, we could probably spend some time thinking about how we use types in writing numpy code. There is some tension between the c types and the numpy types that I notice even when looking at the sorting code where the functions are currently associated with c types rather than numpy sized types. And let's not forget the useful experience one gains by breaking the trunk now and then. It makes one more cognizant of what can go wrong and a bit more careful in the future. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 20:22:07 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 19:22:07 -0500 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: References: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 6:31 PM, Pierre GM wrote: > > On Jun 9, 2011, at 1:10 AM, Mark Wiebe wrote: > > > > > >>> np.timedelta64(10, 's') + 10 > > > numpy.timedelta64(20,'s') > > > > Here, the unit is defined: 's' > > > > For the first operand, the inconsistency is with the second. Here's the > reasoning I didn't spell out: > > We're adding a timedelta + int, so lets convert 10 into a timedelta. No > units specified, so it's > > 10 microseconds, so we add 10 seconds and 10 microseconds, not 10 seconds > and 10 seconds. > > Ah OK. I think that your approach of taking the defined unit (here, s) as > unit of the undefined term (here, 10) is by far the best. > > > >OK, here it is not. But the result makes sense... Up to a certain point. > If you try to guess the unit from a date given as a >string, what happens in > case of ambiguities ? Or do you restrict an input string to be strictly > ISO8601 to remove those ? > > > > Yeah, I'm restricting the string to be (almost) strictly ISO8601. For > supporting other formats, I think creating a 'fancy_date_parser' function or > something like that would be better than having all those date string format > ambiguities in the core type. > > Quite OK. But this 'fancy_date_parser' will likely crash at some point if > the unit cannot be guessed properly. But you're right, that's not the issue > here. > > > > > > > I'd like to make 'M8' and 'm8' be datetime data types with generic time > units instead of microseconds as they are currently. This would also allow > the possibility of extending the behavior of detecting the unit from the > input string as: > > > > > > >>> np.datetime64('2011-03-12T13') > > > numpy.datetime64('2011-03-12T13-0600','h') > > > > > > to also work with arrays, which currently work like this: > > > > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > > array(['2011-03-12T13:00:00.000000-0600', > '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > > > > Why is the second one not '2012-01-01T00:00:00-0600' ? > > > > This is because dates are stored at midnight UTC, and when converted to > local time for the default time-based printing, that changes slightly. > > ISO8601 specifies to interpret an input in local time if no "Z" or > timezone offset is given, so that's why the first one matches. I haven't > been able to think of a way around it other than putting warnings in the > documentation, and have made 'today' and 'now' throw errors if you try to > use them as times or dates respectively. > > I see the logic, but I don't like it at all. I would expect the date to be > stored in the local time zone by default (that is, if no other time zone > info is available). > It's not satisfying to me either, but I haven't been able to think of a solution I like. The idea of converting from 'D' to 's' metadata depending on the timezone setting of your computer feels worse to me than the current approach, but if someone has an idea that's better I'm all ears. -Mark > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 20:26:36 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 19:26:36 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: On Tue, Jun 7, 2011 at 7:28 PM, Pierre GM wrote: > > > > It supports .astype(), with a truncation policy. This is motivated > partially because that's how Pythons integer division works, and partially > because if you consider a full datetime '2011-03-14T13:22:16', it's natural > to think of the year as '2011', the date as '2011-03-14', etc, which is > truncation. With regards to converting in the other direction, you can think > of a datetime as representing a single moment in time, regardless of its > unit of precision, and equate '2011' with '2011-01', etc. > > OK from high to low, less from low to high. That's where our keyword ('END' > or "START") comes into play in scikits.timeseries, so that you can decide > whether '2011' should be '2011-01' or '2011-06'... > Because datetime64 is a NumPy data type, it needs a well-defined rule for these kinds of conversions. Treating datetimes as moments in time instead of time intervals makes a very nice rule which appears to be very amenable to a variety of computations, which is why I like the approach. > > We needed the concept to convert time series, for example from monthly to > quarterly (what is the first month of the year (as in succession of 12 > months) you want to start with ?) > > > > Does that need to be in the underlying datetime for layering a good > timeseries implementation on top? > > Mmh. How would you define a quarter unit ? [3M] ? But then, what if you > want your year to start in December, say (we often use DJF/MAM/JJA/SON as a > way to decompose a year in four 'hydrological' seasons, for example) > With origin metadata I suppose defining these things as NumPy data types will work, but it still feels like it belongs at a higher level. Of course I'm not going to stop anyone from doing it... -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 20:47:59 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 19:47:59 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: On Wed, Jun 8, 2011 at 5:59 AM, Dave Hirschfeld wrote: > Mark Wiebe gmail.com> writes: > > > > > > It appears to me that a structured dtype with some further NumPy > extensions > > could entirely replace the 'events' metadata fairly cleanly. If the > ufuncs > > are extended to operate on structured arrays, and integers modulo n are > > added as a new dtype, a dtype like > > [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current > > 'M8[D]//100'. > > Sounds like a cleaner API. > > > > >> > >> As Dave H. summarized, we used a basic keyword to do the same thing in > >> scikits.timeseries, with the addition of some subfrequencies like A-SEP > >> to represent a year starting in September, for example. It works, but > it's > >> really not elegant a solution. > >> > > > > This kind of thing definitely belongs in a layer above datetime. > > > > That's fair enough - my perspective as a timeseries user is probably a lot > higher level. My main aim is to point out some of the higher level uses so > that > the numpy dtype can be made compatible with them - I'd hate to have a > situation where we have multiple different datetime representations > in the different libraries and having to continually convert at the > boundaries. > > That said, I think the starting point for a series at a weekly, quarterly > or annual frequency/unit/type is something which may need to be sorted out > at the lowest level... > Yeah, that's what we're trying to figure out. I'm trying to get the functionality to work in such a way that implementing a system on top doing all the things you need can be done pretty smoothly. After the next few merge requests, it should be possible to start experimenting with it to try things out and get feedback based on that. > > > > One overall impression I have about timeseries in general is the use of > the > > term "frequency" synonymously with the time unit. To me, a frequency is a > > numerical quantity with a unit of 1/(time unit), so while it's related to > > the time unit, naming it the same is something the specific timeseries > > domain has chosen to do, I think the numpy datetime class shouldn't have > > anything called "frequency" in it, and I would like to remove the current > > usage of that terminology from the codebase. > > > > It seems that it's just a naming convention (possibly not the best) and > can be used synonymously with the "time unit"/resolution/dtype > > > I don't envision 'asfreq' being a datetime function, this is the kind > > of thing that would layer on top in a specialized timeseries library. The > > behavior of timedelta follows a more physics-like idea with regard to the > > time unit, and I don't think something more complicated belongs at the > bottom > > layer that is shared among all datetime uses. > > I think since freq <==> dtype then asfreq <==> astype. From your examples > it > seems to do the same thing - i.e. if you go to a lower resolution (freq) > representation the higher resolution information is truncated - e.g. a > monthly resolution date has no information about days/hours/minutes/seconds > etc. It's converting in the other direction: low --> high resolution > where the difference lies - numpy always converts to the start of the > interval > whereas the timeseries Date class gives you the option of the start or the > end. > The fact that it's a NumPy dtype probably is the biggest limiting factor preventing parameters like 'start' and 'end' during conversion. Having a datetime represent an instant in time neatly removes any ambiguity, so converting between days and seconds as a unit is analogous to converting between int32 and float32. > I'm thinking of a datetime as an infinitesimal moment in time, with the > > unit representing the precision that can be stored. Thus, '2011', > > '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in > > time, but with a different unit of precision. Computationally, this > > perspective is turning out to provide a pretty rich mechanism to do > > operations on date. > > I think this is where the main point of difference is. I use the timeseries > as a container for data which arrives in a certain interval of time. e.g. > monthly temperatures are the average temperature over the interval defined > by > the particular month. It's not the temperature at the instant of time that > the > month began, or ended or of any particular instant in that month. > > Thus the conversion from a monthly resolution to say a daily resolution > isn't > well defined and the right thing to do is likely to be application > specific. > > For the average temperature example you may want to choose a value in the > middle of the month so that you don't introduce a phase delay if you > interpolate between the datapoints. > > If for example you measured total rainfall in a month you might want to > choose the last day of the month to represent the total rainfall for that > month as that was the only date in the interval where the total rainfall > did in fact equal the monthly value. > > It may be as you say though that all this functionality does belong at a > higher > level... > Would it be possible to mock up a set of examples, or find an existing set of examples, which exhibit the kinds of computations that are needed? That could be used to see how easily it is to express the ideas with the numpy datetime machinery. Cheers, Mark > > Regards, > Dave > > > > > > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 20:53:09 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 19:53:09 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Wed, Jun 8, 2011 at 4:57 AM, Wes McKinney wrote: > > > > So in summary, w.r.t. time series data and datetime, the only things I > care about from a datetime / pandas point of view: > > - Ability to easily define custom timedeltas > Can you elaborate on this a bit? I'm guessing you're not referring to the timedelta64 in NumPy, which is simply an integer with an associated unit. > - Generate datetime objects, or some equivalent, which can be used to > back pandas data structures > Do you mean mechanisms to generate sequences of datetime's? I'm fixing up arange to work reasonably with datetimes at the moment. > - (possible now??) Ability to have a set of frequency-naive dates > (possibly not in order). > This should work just as well as if you have an arbitrary set of integers specifying the locations of the sample points. Cheers, Mark This last point actually matters. Suppose you wanted to get the worst > 5-performing days in the S&P 500 index: > > In [7]: spx.index > Out[7]: > > offset: <1 BusinessDay>, tzinfo: None > [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] > length: 2963 > > # but this is OK > In [8]: spx.order()[:5] > Out[8]: > 2008-10-15 00:00:00 -0.0903497960942 > 2008-12-01 00:00:00 -0.0892952780505 > 2008-09-29 00:00:00 -0.0878970494885 > 2008-10-09 00:00:00 -0.0761670761671 > 2008-11-20 00:00:00 -0.0671229140321 > > - W > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 8 21:04:27 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 8 Jun 2011 20:04:27 -0500 Subject: [Numpy-discussion] Current datetime pull requests ready for review In-Reply-To: References: Message-ID: On Wed, Jun 8, 2011 at 6:53 PM, Charles R Harris wrote: > > > On Wed, Jun 8, 2011 at 4:09 PM, Benjamin Root wrote: > >> On Wednesday, June 8, 2011, Mark Wiebe wrote: >> > I've been trying to separate out the features I've been doing into >> separate branches so I can split up into multiple pull, but the development >> I'm doing works pretty poorly with this workflow. Nearly every step of >> development builds on or uses changes from previous features. Since github >> doesn't do dependent pull requests, I'm going to go back to developing in a >> single branch to merge into master periodically. >> > >> > There are currently two pull requests ready to go, which could use some >> review: >> > "Datetime unit from string" https://github.com/numpy/numpy/pull/85 >> > >> > This makes conversion from a string to a datetime scalar automatically >> detect the unit if it's not specified. To support the same idea for arrays >> will require adding generic time units as I described in another email. >> > >> > "Unify datetime/timedelta type promotion" >> https://github.com/numpy/numpy/pull/86 >> > >> > This simplifies the rules of how datetimes and timedeltas combine to >> always take the gcd/more precise unit instead of having the various special >> cases as the NEP described. The motivation for this change was based on how >> restricting those rules felt after spending some time experimenting with >> datetime calculations. >> > >> > Please review! >> > Cheers,Mark >> > >> >> Just a thought. I think one way to do something like dependent pull >> requests is to develop each feature in it's own branch, and then have >> a branch that would merge some of those feature branches. These >> branches would represent the logical dependencies needed, and could be >> arbitrarially nested. >> >> It wouldn't quite be exactly what you are wanting, but maybe >> something similar could be done with pull requests, since one can >> continue developing in a branch until the pull is done. >> >> Just thinking out loud... >> >> > I don't think it is generally possible to develop in a straight line, > breaking things up into a nice sequence can probably only be done after the > fact and it takes work. There is a trade off here as to how the scarce > resource of developer time gets allocated between the tasks. Mark is a very > productive guy so the stuff just piles up and there aren't that many doing > review. OTOH, doing review is much easier if the chunks are smaller. > Probably a bit more time should to go into producing and polishing the > chunks but just how to make that division of labor is probably something we > will only arrive at with some experience. > Yeah, one thing I'm trying to do in my recent commits is elaborate slightly more in the commit messages and make the commits slightly smaller. Maybe making topic-related tags in the commit messages might be a way to help group related commits without having to muck about with history-editing. I'm open to trying out all kinds of suggestions. As to the current problem, we could probably spend some time thinking about > how we use types in writing numpy code. There is some tension between the c > types and the numpy types that I notice even when looking at the sorting > code where the functions are currently associated with c types rather than > numpy sized types. > I think the code backing this aspect of the system basically belongs in a low-level "libdatatype" to encapsulate the meaning of a data type to numpy. If we want to add, say, integers mod n, integers with a NaN value, quaternions, etc, having a systematic backing which allows for minimal work to add a new type will be very helpful. And let's not forget the useful experience one gains by breaking the trunk > now and then. It makes one more cognizant of what can go wrong and a bit > more careful in the future. > Allowing such breakages, but requiring responsive fixes from those that cause them, is probably a good policy for letting in more developers. If people are paralyzed from making changes because they might break something, they're not going to be very productive at creating more functionality. Cheers, Mark > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oliphant at enthought.com Thu Jun 9 00:49:01 2011 From: oliphant at enthought.com (Travis Oliphant) Date: Wed, 8 Jun 2011 23:49:01 -0500 Subject: [Numpy-discussion] Returning the same dtype in Cython as np.argmax In-Reply-To: References: Message-ID: <915111FA-820F-4E69-A622-7BD3B2F99191@enthought.com> On Jun 7, 2011, at 3:17 PM, Keith Goodman wrote: > What is the rule to determine the dtype returned by numpy functions > that return indices such as np.argmax? The return type of indices will be np.intp. I don't know if Cython provides that type or not. If not, then it is either np.int32 or np.int64 depending on the system. Best , -Travis > > I assumed that np.argmax() returned the same dtype as np.int_. That > works on linux32/64 and win32 but on win-amd64 np.int_ is int32 and > np.argmax() returns int64. > > Someone suggested using intp. But I can't figure out how to do that in > Cython since I am unable to cimport NPY_INTP. So this doesn't work: > > from numpy cimport NPY_INTP > cdef np.ndarray[np.intp_t, ndim=1] output = PyArray_EMPTY(1, dims, NPY_INTP, 0) > > Is there another way? All I can think of is checking whether np.intp > is np.int32 or int64 when I template the code. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion --- Travis Oliphant Enthought, Inc. oliphant at enthought.com 1-512-536-1057 http://www.enthought.com From pgmdevlist at gmail.com Thu Jun 9 02:38:42 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 9 Jun 2011 08:38:42 +0200 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: References: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> Message-ID: <25E47041-5FA9-4B77-A315-004DE1FCC12E@gmail.com> On Jun 9, 2011, at 2:22 AM, Mark Wiebe wrote: > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > > array(['2011-03-12T13:00:00.000000-0600', '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > > > > Why is the second one not '2012-01-01T00:00:00-0600' ? > > > > This is because dates are stored at midnight UTC, and when converted to local time for the default time-based printing, that changes slightly. > > ISO8601 specifies to interpret an input in local time if no "Z" or timezone offset is given, so that's why the first one matches. I haven't been able to think of a way around it other than putting warnings in the documentation, and have made 'today' and 'now' throw errors if you try to use them as times or dates respectively. > > I see the logic, but I don't like it at all. I would expect the date to be stored in the local time zone by default (that is, if no other time zone info is available). > > It's not satisfying to me either, but I haven't been able to think of a solution I like. The idea of converting from 'D' to 's' metadata depending on the timezone setting of your computer feels worse to me than the current approach, but if someone has an idea that's better I'm all ears. A simpler approach could be to drop ISO8601 recommendation and assume that any single datetime is expressed in UTC by default. Managing time zones would then be let to some specific subclass... From pgmdevlist at gmail.com Thu Jun 9 02:44:40 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 9 Jun 2011 08:44:40 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: <20D9BDDD-1CF4-4B73-BDD5-2D77A7AF4D01@gmail.com> > > The fact that it's a NumPy dtype probably is the biggest limiting factor preventing parameters like 'start' and 'end' during conversion. Having a datetime represent an instant in time neatly removes any ambiguity, so converting between days and seconds as a unit is analogous to converting between int32 and float32. Back to Dave's example, '2011' is not necessarily the same instant in time as '2011-01'. The conversion from low to high frequencies requires some reference. In scikits.timeseries terms, we could assume that 'START' is the reference. No problem with that, if we can have a way to extend that (eg, through some metadata). I tried something like that in the experimental version of scikits.timeseries I was babbling about earlier... From yoshi at rokuko.net Thu Jun 9 04:39:39 2011 From: yoshi at rokuko.net (Yoshi Rokuko) Date: Thu, 09 Jun 2011 10:39:39 +0200 Subject: [Numpy-discussion] numpy c api general array handling In-Reply-To: References: <201106080758.p587wAP7032347@lotus.yokuts.org> Message-ID: <201106090839.p598ddCV010919@lotus.yokuts.org> +-------------------------------------------- Mark Wiebe -----------+ > > I believe what you may want is PyArray_ContiguousFromAny instead of > PyArray_ContiguousFromObject. > are you sure that there is a difference? i think its just the new name for the same thing right? best regards, yoshi From dave.hirschfeld at gmail.com Thu Jun 9 05:17:38 2011 From: dave.hirschfeld at gmail.com (Dave Hirschfeld) Date: Thu, 9 Jun 2011 09:17:38 +0000 (UTC) Subject: [Numpy-discussion] Default unit for datetime/timedelta References: Message-ID: Mark Wiebe gmail.com> writes: > > Here are some current behaviors that are inconsistent with the microsecond default, but consistent with the "generic time unit" idea: > > >>> np.timedelta64(10, 's') + 10 > numpy.timedelta64(20,'s') > > That is what I would expect (and hope) would happen. IMO an integer should be cast to the dtype ([s]) of the datetime/timedelta. > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > array(['2011-03-12T13:00:00.000000-0600', '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > I would expect the second value of the array to be midnight at the end of the year. I'm not sure what is actually happening in the above example. What happens if I email that code to someone in New Zealand would they get a different array?? -Dave From gmane at blindgoat.org Thu Jun 9 09:37:36 2011 From: gmane at blindgoat.org (martin smith) Date: Thu, 09 Jun 2011 09:37:36 -0400 Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows In-Reply-To: References: Message-ID: On 6/5/2011 5:46 PM, martin smith wrote: > On 6/3/2011 5:15 AM, Ralf Gommers wrote: >> >> ... >> Coming back to #608, that means there is no chance that the C version >> will land in scipy, correct? We're not going to ship two copies of >> FFTPACK. So the answer should be "rewrite in Python, if that's too slow >> use Cython". >> >> Ralf >> > > Thank you both for your prompt response and suggestions. I've modified > the setup.py file as Ralf suggested and concocted a version of the code > (summarized below) that doesn't try to link to existing modules. > > The formal problem with a python rewrite is that there is some > functionality in the package (the adaptive spectrum estimate) that > requires solving an iterative non-linear equation at each frequency. A > practical problem is that the odds of my rewriting the core code in > python in the near future are not good. > > I've made an include file for the core code from a version of fftpack.c > by stripping out the unused functions and making the survivors static. > Using this means, of course, that you'd be partly duplicating fftpack > from a binary bloat point of view but it leads to a plausible solution, > at least. > > I'm currently having (new) problems compiling on win7 64-bits > (msvcr90.dll for 64bits is missing and I haven't had any luck googling > it down yet) as well as on winxp 32-bits (my installation of mingw is > confused and can't find all of its parts). I'm planning to stick with > it a while longer and see if I can get past these bumps. If I succeed > I'll reraise the possibility of getting it into scipy. > > - martin I've gone on and gotten the package to work under both win32 and linux. I gather that if I want to proceed I should convert what I have into a scikit for submission. As best I can tell that involves (1) writing sphinx documentation, (2) using setuptools, and (3) assembling the bits into the proper directory structure for inclusion in scikits. I'm game to go on with this but I'm concerned that the effort will be futile for the reason that Ralf suggested: duplication of some fftpack code in my module. Does anyone have a clear idea whether or not this issue will be fatal for scikit acceptance? From pav at iki.fi Thu Jun 9 09:48:28 2011 From: pav at iki.fi (Pauli Virtanen) Date: Thu, 9 Jun 2011 13:48:28 +0000 (UTC) Subject: [Numpy-discussion] need help building a numpy extension that uses fftpack on windows References: Message-ID: Thu, 09 Jun 2011 09:37:36 -0400, martin smith wrote: [clip] > I'm game to go on with this but I'm concerned that the effort will be > futile for the reason that Ralf suggested: duplication of some fftpack > code in my module. Does anyone have a clear idea whether or not this > issue will be fatal for scikit acceptance? This is not a problem. There are no formal acceptance criteria for a scikit, nor a central authority making such decisions --- it's mostly a question of commonly agreed on naming conventions for packages. As long as your package has something to do with science, I'm sure nobody will object to its existence. Naming your package scikits.something may give it a small boost in visibility, but apart from that, it's just an ordinary Python package. -- Pauli Virtanen From mwwiebe at gmail.com Thu Jun 9 10:08:54 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 09:08:54 -0500 Subject: [Numpy-discussion] numpy c api general array handling In-Reply-To: <201106090839.p598ddCV010919@lotus.yokuts.org> References: <201106080758.p587wAP7032347@lotus.yokuts.org> <201106090839.p598ddCV010919@lotus.yokuts.org> Message-ID: On Thu, Jun 9, 2011 at 3:39 AM, Yoshi Rokuko wrote: > +-------------------------------------------- Mark Wiebe -----------+ > > > > I believe what you may want is PyArray_ContiguousFromAny instead of > > PyArray_ContiguousFromObject. > > > > are you sure that there is a difference? i think its just the new > name for the same thing right? > If you tried it and it behaves exactly the same, then I guess not. Using PyArray_FromAny with the appropriate flags would be the fallback approach in that case. -Mark > > best regards, yoshi > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Thu Jun 9 14:27:56 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 09 Jun 2011 11:27:56 -0700 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> Message-ID: <4DF110AC.5050903@noaa.gov> Mark Wiebe wrote: > Because datetime64 is a NumPy data type, it needs a well-defined rule > for these kinds of conversions. Treating datetimes as moments in time > instead of time intervals makes a very nice rule which appears to be > very amenable to a variety of computations, which is why I like the > approach. This really is the key issue that I've been harping on (sorry...) this whole thread: For many uses, a datetime as a moment in time is a great abstraction, and I think how most datetime implementations (like the std lib one) are used. However, when you are trying to represent/work with data like monthly averages and the like, you need something that represents something else -- and trying to use the same mechanism as for time instants, and hoping the the ambiguities will resolve themselves from the context is dangerous. I don't work in finance, so I'm not sure about things like b-monthly payments -- it seems those could well be defined as instances -- the payments are due on a given day each month( say the 1st and 15th), and, I assume that is well defined to the instant -- i.e. before the end of the day in some time zone. (note that that would be hour time: 23:59.99999, rather than, zero, however). The trick with these comes in when you do math -- the timedelta issue -- what is a 1 month timedelta? It's NOT an given number of days, hours, etc. I don't know that anyone has time to do this, but it seems a written up set of use-cases would help focus this conversation -- I know I've pretty lost of what uses we are trying to support. another question: can you instantiate a datetime64 with something other than a string? i.e. a (year, [month], [day], [hour], [second], [usecond]) tuple? > The fact that it's a NumPy dtype probably is the biggest limiting > factor preventing parameters like 'start' and 'end' during conversion. > Having a datetime represent an instant in time neatly removes any > ambiguity, so converting between days and seconds as a unit is > analogous to converting between int32 and float32. Sure, but I don't know that that is the best way to go -- integers are precisely defined and generally used as 3 == 3.00000000 That's not the case for months, at least if it's supposed be be a monthly average-type representation. This reminds me a question recently on this list -- someone was using np.histogram() to bin integer values, and was surprised at the results -- what they needed to do was consider the bin intervals as floating point numbers to get what they wanted: 0.5, 1.5, 2.5, rather than 1,2,3,4, because what they really wanted was an categorical definition of an integer, NOT a truncated floating point number. I'm not sure how that informs this conversation, though... > > >>> np.timedelta64(10, 's') + 10 > > numpy.timedelta64(20,'s') > > Here, the unit is defined: 's' > > For the first operand, the inconsistency is with the second. Here's > the reasoning I didn't spell out: > We're adding a timedelta + int, so lets convert 10 into a timedelta. > No units specified, so it's > 10 microseconds, so we add 10 seconds and 10 microseconds, not 10 > seconds and 10 seconds. This sure seems ripe for error to me -- if a datetime and timedelta are going to be represented in various possible units, then I don't think it it's a good idea to allow one to and an integer -- especially if the unit can be inferred from the input data, rather than specified. "Explicit is better than implicit." "In the face of ambiguity, refuse the temptation to guess." If you must allow this, then using the default for the unspecified unit as above is the way to go. Dave Hirschfeld wrote: >> Here are some current behaviors that are inconsistent with the microsecond > default, but consistent with the "generic time unit" idea: >>>>> np.timedelta64(10, 's') + 10 >> numpy.timedelta64(20,'s') > > That is what I would expect (and hope) would happen. IMO an integer should be > cast to the dtype ([s]) of the datetime/timedelta. This is way too ripe for error, particularly if we have the unit auto-determined from input data. Not to take us back to a probably already resolved issue, but maybe all this unit conversion could and should be avoided by following the python datetime approach -- all datetimes and timedeltas are always defined with microsecond precision -- period. Maybe there are computational efficiencies that we want to avoid. This would also preclude any use of these dtypes for work that required greater precision, but does anyone really need both year, month, day specification AND nanoseconds? Given all the leap-second issues, that seems a bit ridiculous. But it would make things easier. I note that in this entire conversation, all the talk has been about finance examples -- I think I'm the only one that has brought up science use, and that only barely (and mostly the simple cases). So do we really need to have the same dtype useful for finance and particle physics? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From bbelson at princeton.edu Thu Jun 9 15:33:29 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Thu, 9 Jun 2011 15:33:29 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: Hello all, Since I haven't heard back, I decided to simplify my problem even more. I'm attaching two files which demonstrate the problem very clearly. My original message is below. The script can be run with "python shared_mem.py". Thanks, Brandt > > Hello, > > I'm parallelizing some code I've written using the built in > multiprocessing > > module. In my application, I need to multiply many large arrays together > > and > > sum the resulting product arrays (inner products). I noticed that when I > > parallelized this with myPool.map(...) with 8 processes (on an 8-core > > machine), the code was actually about an order of magnitude slower. I > > realized that this only happens when I use the numpy array > multiplication. > > I > > defined my own array multiplication and summation. As expected, this is > > much > > slower than the numpy versions, but when I parallelized it I saw the > > roughly > > 8x speedup I expected. So something about numpy array multiplication > > prevented me from speeding it up with multiprocessing. > > > > Is there a way to speed up numpy array multiplication with > multiprocessing? > > I noticed that numpy's SVD uses all of the available cores on its own, > does > > numpy array multiplication do something similar already? > > > > I'm copying the code which can reproduce and summarize these results. > > Simply > > run: "python shared_mem.py" with myutil.py in the same directory. > > > > Thanks, > > Brandt > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: shared_mem.py Type: application/octet-stream Size: 1522 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: myutil.py Type: application/octet-stream Size: 384 bytes Desc: not available URL: From mwwiebe at gmail.com Thu Jun 9 16:01:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 15:01:17 -0500 Subject: [Numpy-discussion] code review for datetime arange Message-ID: I've replaced the previous two pull requests with a single pull request rolling up all the changes so far. The newest changes include finishing the generic unit and np.arange function support. https://github.com/numpy/numpy/pull/87 Because of the nature of datetime and timedelta, arange has to be slightly different than with all the other types. In particular, for datetime the primary signature is np.arange(datetime, datetime, timedelta). I've implemented a simple extension which allows for another way to specify a date range, as np.arange(datetime, timedelta, timedelta). Here (start, delta) represents the datetime range [start, start+delta). Some examples: >>> np.arange('2011', '2020', dtype='M8[Y]') array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'], dtype='datetime64[Y]') >>> np.arange('today', 10, 3, dtype='M8') array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], dtype='datetime64[D]') >>> a = np.datetime64('today', 'M') >>> np.arange(a, a+1, dtype='M8[D]') array(['2011-06-01', '2011-06-02', '2011-06-03', '2011-06-04', '2011-06-05', '2011-06-06', '2011-06-07', '2011-06-08', '2011-06-09', '2011-06-10', '2011-06-11', '2011-06-12', '2011-06-13', '2011-06-14', '2011-06-15', '2011-06-16', '2011-06-17', '2011-06-18', '2011-06-19', '2011-06-20', '2011-06-21', '2011-06-22', '2011-06-23', '2011-06-24', '2011-06-25', '2011-06-26', '2011-06-27', '2011-06-28', '2011-06-29', '2011-06-30'], dtype='datetime64[D]') Please build the branch available here, https://github.com/m-paradox/numpy try it out, and then discuss! -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 9 16:06:45 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 15:06:45 -0500 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: <25E47041-5FA9-4B77-A315-004DE1FCC12E@gmail.com> References: <75C12E2F-C2B7-4C61-BC2D-EE0AAE1CD7DF@gmail.com> <25E47041-5FA9-4B77-A315-004DE1FCC12E@gmail.com> Message-ID: On Thu, Jun 9, 2011 at 1:38 AM, Pierre GM wrote: > > On Jun 9, 2011, at 2:22 AM, Mark Wiebe wrote: > > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > > > array(['2011-03-12T13:00:00.000000-0600', > '2011-12-31T18:00:00.000000-0600'], dtype='datetime64[us]') > > > > > > Why is the second one not '2012-01-01T00:00:00-0600' ? > > > > > > This is because dates are stored at midnight UTC, and when converted to > local time for the default time-based printing, that changes slightly. > > > ISO8601 specifies to interpret an input in local time if no "Z" or > timezone offset is given, so that's why the first one matches. I haven't > been able to think of a way around it other than putting warnings in the > documentation, and have made 'today' and 'now' throw errors if you try to > use them as times or dates respectively. > > > > I see the logic, but I don't like it at all. I would expect the date to > be stored in the local time zone by default (that is, if no other time zone > info is available). > > > > It's not satisfying to me either, but I haven't been able to think of a > solution I like. The idea of converting from 'D' to 's' metadata depending > on the timezone setting of your computer feels worse to me than the current > approach, but if someone has an idea that's better I'm all ears. > > A simpler approach could be to drop ISO8601 recommendation and assume that > any single datetime is expressed in UTC by default. Managing time zones > would then be let to some specific subclass... Doing nothing with time zones would mean this prints by default: >>> np.datetime_as_string(np.datetime64('now')) '2011-06-09T20:03:20Z' instead of this: >>> np.datetime64('now') numpy.datetime64('2011-06-09T15:03:28-0500','s') The issue above is the relationship between dates and datetimes, which the NumPy datetime64 has in a continuum. From a UTC perspective, this is very natural, but it does cause a few hiccups. I'll make a thread at some point for discussing the time zone issues. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 9 16:08:55 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 15:08:55 -0500 Subject: [Numpy-discussion] Default unit for datetime/timedelta In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 4:17 AM, Dave Hirschfeld wrote: > Mark Wiebe gmail.com> writes: > > > > > Here are some current behaviors that are inconsistent with the > microsecond > default, but consistent with the "generic time unit" idea: > > > > >>> np.timedelta64(10, 's') + 10 > > numpy.timedelta64(20,'s') > > > > > > That is what I would expect (and hope) would happen. IMO an integer should > be > cast to the dtype ([s]) of the datetime/timedelta. > > > > > >>> np.array(['2011-03-12T13', '2012'], dtype='M8') > > array(['2011-03-12T13:00:00.000000-0600', > '2011-12-31T18:00:00.000000-0600'], > dtype='datetime64[us]') > > > > I would expect the second value of the array to be midnight at the end of > the > year. I'm not sure what is actually happening in the above example. What > happens > if I email that code to someone in New Zealand would they get a different > array?? They will get exactly the same result as you do under the hood for the second one, but it will print differently with the local timezone printing convention. With the timezone encoded in the string as it is, that string does represent midnight at the start of the year in UTC. To get the same result for the first one, you would specify '2011-03-12T13Z', indicating the time is in UTC instead of local time. -Mark > > -Dave > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Thu Jun 9 16:11:40 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 09 Jun 2011 13:11:40 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: <4DF128FC.8000807@noaa.gov> Not much time, here, but since you got no replies earlier: > > I'm parallelizing some code I've written using the built in > multiprocessing > > module. In my application, I need to multiply many large arrays > together is the matrix multiplication, or element-wise? If matrix, then numpy should be using LAPACK, which, depending on how its built, could be using all your cores already. This is heavily dependent on your your numpy (really the LAPACK it uses0 is built. > > and > > sum the resulting product arrays (inner products). are you using numpy.dot() for that? If so, then the above applies to that as well. I know I could look at your code to answer these questions, but I thought this might help. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From wesmckinn at gmail.com Thu Jun 9 16:17:48 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 9 Jun 2011 16:17:48 -0400 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Wed, Jun 8, 2011 at 8:53 PM, Mark Wiebe wrote: > On Wed, Jun 8, 2011 at 4:57 AM, Wes McKinney wrote: >> >> >> >> >> So in summary, w.r.t. time series data and datetime, the only things I >> care about from a datetime / pandas point of view: >> >> - Ability to easily define custom timedeltas > > Can you elaborate on this a bit? I'm guessing you're not referring to the > timedelta64 in NumPy, which is simply an integer with an associated unit. I guess what I am thinking of may not need to be available at the NumPy level. As an example, suppose you connected a timedelta (DateOffset, in pandas parlance) to a set of holidays, so when you say: date + BusinessDay(1, calendar='US') then you have some custom logic which knows how to describe the result of that. Some things along these lines have already been discussed-- but this is the basic way that the date offsets work in pandas, i.e. subclasses of DateOffset which implement custom logic. Probably too high level for NumPy. >> >> - Generate datetime objects, or some equivalent, which can be used to >> back pandas data structures > > Do you mean mechanisms to generate sequences of datetime's? I'm fixing up > arange to work reasonably with datetimes at the moment. Yes, that will be very nice. >> >> - (possible now??) Ability to have a set of frequency-naive dates >> (possibly not in order). > > This should work just as well as if you have an arbitrary set of integers > specifying the locations of the sample points. > Cheers, > Mark Cool (!). >> >> This last point actually matters. Suppose you wanted to get the worst >> 5-performing days in the S&P 500 index: >> >> In [7]: spx.index >> Out[7]: >> >> offset: <1 BusinessDay>, tzinfo: None >> [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] >> length: 2963 >> >> # but this is OK >> In [8]: spx.order()[:5] >> Out[8]: >> 2008-10-15 00:00:00 ? ?-0.0903497960942 >> 2008-12-01 00:00:00 ? ?-0.0892952780505 >> 2008-09-29 00:00:00 ? ?-0.0878970494885 >> 2008-10-09 00:00:00 ? ?-0.0761670761671 >> 2008-11-20 00:00:00 ? ?-0.0671229140321 >> >> - W >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From Chris.Barker at noaa.gov Thu Jun 9 16:41:40 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 09 Jun 2011 13:41:40 -0700 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: Message-ID: <4DF13004.5090304@noaa.gov> Mark Wiebe wrote: > Because of the nature of datetime and timedelta, arange has to be > slightly different than with all the other types. In particular, for > datetime the primary signature is np.arange(datetime, datetime, timedelta). > > I've implemented a simple extension which allows for another way to > specify a date range, as np.arange(datetime, timedelta, timedelta). instead of, or in addition to, the above? it seems you can pass in the following types: strings np.datetime64 np.timedelta64 integers (floats ?) Are you essentially doing method overloading to determine what it all means? How do you know if: np.arange('2011', '2020', dtype='M8[Y]') means you want from the years 2011 to 2020 or from 2011 to 4031? > >>> np.arange('today', 10, 3, dtype='M8') > array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], > dtype='datetime64[D]') so dtype 'M8' defaults to increments of days? of course, I've lost track of the difference between 'M' and 'M8' (I've never liked the dtype code anyway -- I far prefer np.float64 to 'd', for instance) Will there be a "linspace" for datetimes? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Thu Jun 9 16:58:57 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 15:58:57 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: <4DF13004.5090304@noaa.gov> References: <4DF13004.5090304@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker wrote: > Mark Wiebe wrote: > > Because of the nature of datetime and timedelta, arange has to be > > slightly different than with all the other types. In particular, for > > datetime the primary signature is np.arange(datetime, datetime, > timedelta). > > > > I've implemented a simple extension which allows for another way to > > specify a date range, as np.arange(datetime, timedelta, timedelta). > > instead of, or in addition to, the above? > In addition to. > it seems you can pass in the following types: > > strings > np.datetime64 > np.timedelta64 > integers > (floats ?) > It's like pretty much everywhere in NumPy, where conversion to the target type is done when needed. Are you essentially doing method overloading to determine what it all means? > That's right, if any of the start, stop, or step are datetime-related types (including python datetime types), or the dtype provided is datetime/timedelta, it calls the datetime-specific arange function. How do you know if: > > np.arange('2011', '2020', dtype='M8[Y]') > Only timedeltas and integers result in the (datetime, timedelta) mode. Specifying an integer directly as a datetime is itself a very odd thing, as a timedelta it makes much more sense. means you want from the years 2011 to 2020 or from 2011 to 4031? > >>> np.arange('2011', '2020', dtype='M8[Y]') array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'], dtype='datetime64[Y]') > >>> np.arange('today', 10, 3, dtype='M8') > > array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], > > dtype='datetime64[D]') > > so dtype 'M8' defaults to increments of days? > No, 'M8' means generic/unspecified units, and 'today' defaults to units of days. The generic units are similar to how the string dtype doesn't specify a size. of course, I've lost track of the difference between 'M' and 'M8' > > (I've never liked the dtype code anyway -- I far prefer np.float64 to > 'd', for instance) > > Will there be a "linspace" for datetimes? > That can certainly be done, yes. -Mark -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Thu Jun 9 17:27:19 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Thu, 9 Jun 2011 23:27:19 +0200 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: > On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker > wrote: > > Your branch works fine for me (OS X, py2.6), no failures. Only a few deprecation warnings like: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they are platform specific. Use 'O' instead callableObj(*args, **kwargs) > Mark Wiebe wrote: >> > Because of the nature of datetime and timedelta, arange has to be >> > slightly different than with all the other types. In particular, for >> > datetime the primary signature is np.arange(datetime, datetime, >> timedelta). >> > >> > I've implemented a simple extension which allows for another way to >> > specify a date range, as np.arange(datetime, timedelta, timedelta). >> >> Did you think about how to document which of these basic functions work with datetime? I don't think that belongs in the docstrings, but it may then be hard for the user to figure out which functions accept datetimes. And there will be no usage examples in the docstrings. Besides docs, I am not sure about your choice to modify functions like arange instead of writing a module of wrapper functions for them that know what to do with the dtype. If you have a module you can group all relevant functions, so they're easy to find. Plus it's more future-proof - if at some point numpy grows another new dtype, just create a new module with wrapper funcs for that dtype. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 9 17:27:38 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 9 Jun 2011 16:27:38 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 15:01, Mark Wiebe wrote: > I've replaced the previous two pull requests with a single pull request > rolling up all the changes so far. The newest changes include finishing the > generic unit and np.arange function support. > https://github.com/numpy/numpy/pull/87 > Because of the nature of datetime and timedelta, arange has to be slightly > different than with all the other types. In particular, for datetime the > primary signature is np.arange(datetime, datetime, timedelta). > I've implemented a simple extension which allows for another way to specify > a date range, as np.arange(datetime, timedelta, timedelta). Here (start, > delta) represents the datetime range [start, start+delta). Some examples: >>>> np.arange('2011', '2020', dtype='M8[Y]') > array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', > ? ? ? ?'2019'], dtype='datetime64[Y]') >>>> np.arange('today', 10, 3, dtype='M8') > array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], > dtype='datetime64[D]') I would prefer that we not further overload the signature of np.arange() for this case. A new function dtrange() that can take a delta would be preferable. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From rowen at uw.edu Thu Jun 9 17:46:49 2011 From: rowen at uw.edu (Russell E. Owen) Date: Thu, 09 Jun 2011 14:46:49 -0700 Subject: [Numpy-discussion] numpy build: automatic fortran detection Message-ID: What would it take to automatically detect which flavor of fortran to use to build numpy on linux? The unit tests are clever enough to detect a mis-build (though surprisingly that is not done as part of the build process), so surely it can be done. Even if there is no interest in putting this into numpy's setup.py, we have a need for it for our own system. Any suggestions on how to do this robustly would be much appreciated. -- Russell From mwwiebe at gmail.com Thu Jun 9 17:54:43 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 16:54:43 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers wrote: > > > On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: > >> On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker > > wrote: >> >> Your branch works fine for me (OS X, py2.6), no failures. Only a few > deprecation warnings like: > /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: > DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they > are platform specific. Use 'O' instead > callableObj(*args, **kwargs) > It looks like there are some '|O4' dtypes in 'lib/tests/test_format.py', testing the .npy file format. I'm not sure why I'm not getting this warning though. > Mark Wiebe wrote: >>> > Because of the nature of datetime and timedelta, arange has to be >>> > slightly different than with all the other types. In particular, for >>> > datetime the primary signature is np.arange(datetime, datetime, >>> timedelta). >>> > >>> > I've implemented a simple extension which allows for another way to >>> > specify a date range, as np.arange(datetime, timedelta, timedelta). >>> >>> Did you think about how to document which of these basic functions work > with datetime? I don't think that belongs in the docstrings, but it may then > be hard for the user to figure out which functions accept datetimes. And > there will be no usage examples in the docstrings. > I think documenting it in a 'datetime' section of the arange documentation would be reasonable. The main datetime documentation page would also mention the functions that are most useful. Besides docs, I am not sure about your choice to modify functions like > arange instead of writing a module of wrapper functions for them that know > what to do with the dtype. If you have a module you can group all relevant > functions, so they're easy to find. Plus it's more future-proof - if at some > point numpy grows another new dtype, just create a new module with wrapper > funcs for that dtype. > The facts that datetime and timedelta are related in a particular way different from other data types, and that they are parameterized types, both contribute to them not fitting naturally the current structure of NumPy. I'm not sure I understand the module idea, I would rather think that since it's a built-in NumPy data type, it should work with the regular NumPy functions wherever that makes sense. The implementation can also be changed in the future without affecting the behavior of the function, since it's the semantics which are most important. The experience hooking in datetime and timedelta can guide future abstraction to more general dtype mechanisms. -Mark > > Cheers, > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 9 17:55:07 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 9 Jun 2011 15:55:07 -0600 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 3:27 PM, Robert Kern wrote: > On Thu, Jun 9, 2011 at 15:01, Mark Wiebe wrote: > > I've replaced the previous two pull requests with a single pull request > > rolling up all the changes so far. The newest changes include finishing > the > > generic unit and np.arange function support. > > https://github.com/numpy/numpy/pull/87 > > Because of the nature of datetime and timedelta, arange has to be > slightly > > different than with all the other types. In particular, for datetime the > > primary signature is np.arange(datetime, datetime, timedelta). > > I've implemented a simple extension which allows for another way to > specify > > a date range, as np.arange(datetime, timedelta, timedelta). Here (start, > > delta) represents the datetime range [start, start+delta). Some examples: > >>>> np.arange('2011', '2020', dtype='M8[Y]') > > array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', > > '2019'], dtype='datetime64[Y]') > >>>> np.arange('today', 10, 3, dtype='M8') > > array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], > > dtype='datetime64[D]') > > I would prefer that we not further overload the signature of > np.arange() for this case. A new function dtrange() that can take a > delta would be preferable. > > I wonder if it would be better to have separate functions for the normal range and (later) linspace functions. Dates are sufficiently different from the usual run of numeric types that it might be better to have functions of similar functionality but different names so that it is always clear what is being asked for. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Thu Jun 9 18:21:49 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Fri, 10 Jun 2011 00:21:49 +0200 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 11:54 PM, Mark Wiebe wrote: > On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers wrote: > >> >> >> On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: >> >>> On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker < >>> Chris.Barker at noaa.gov> wrote: >>> >>> Your branch works fine for me (OS X, py2.6), no failures. Only a few >> deprecation warnings like: >> /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: >> DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they >> are platform specific. Use 'O' instead >> callableObj(*args, **kwargs) >> > > It looks like there are some '|O4' dtypes in 'lib/tests/test_format.py', > testing the .npy file format. I'm not sure why I'm not getting this warning > though. > > >> Mark Wiebe wrote: >>>> > Because of the nature of datetime and timedelta, arange has to be >>>> > slightly different than with all the other types. In particular, for >>>> > datetime the primary signature is np.arange(datetime, datetime, >>>> timedelta). >>>> > >>>> > I've implemented a simple extension which allows for another way to >>>> > specify a date range, as np.arange(datetime, timedelta, timedelta). >>>> >>>> Did you think about how to document which of these basic functions work >> with datetime? I don't think that belongs in the docstrings, but it may then >> be hard for the user to figure out which functions accept datetimes. And >> there will be no usage examples in the docstrings. >> > > I think documenting it in a 'datetime' section of the arange documentation > would be reasonable. The main datetime documentation page would also mention > the functions that are most useful. > > Besides docs, I am not sure about your choice to modify functions like >> arange instead of writing a module of wrapper functions for them that know >> what to do with the dtype. If you have a module you can group all relevant >> functions, so they're easy to find. Plus it's more future-proof - if at some >> point numpy grows another new dtype, just create a new module with wrapper >> funcs for that dtype. >> > > The facts that datetime and timedelta are related in a particular way > different from other data types, and that they are parameterized types, both > contribute to them not fitting naturally the current structure of NumPy. I'm > not sure I understand the module idea, > Basically, use np.datetime.arange which understand the dtype, then calls np.arange under the hood. Or is just its own function, like the dtrange() Robert just suggested. It's pretty much the same as for the ma module, which reimplements or wraps many numpy functions that do not understand masked arrays. > I would rather think that since it's a built-in NumPy data type, it should > work with the regular NumPy functions wherever that makes sense. > That doesn't make sense to me. Being a dtype that happens to be shipped with numpy doesn't make it more special than other dtypes. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 9 18:23:46 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 17:23:46 -0500 Subject: [Numpy-discussion] datetime business day API Message-ID: Here's a possible design for a business day API for numpy datetimes: The 'B' business day unit will be removed. All business day-related calculations will be done using the 'D' day unit. A class *BusinessDayDef* to encapsulate the definition of the business week and holidays. The business day functions will either take one of these objects, or separate weekmask and holidays parameters, to specify the business day definition. This class serves as both a performance optimization and a way to encapsulate the weekmask and holidays together, for example if you want to make a dictionary mapping exchange names to their trading days definition. The weekmask can be specified in a number of ways, and internally becomes a boolean array with 7 elements with True for the days Monday through Sunday which are valid business days. Some different notations are for the 5-day week include [1,1,1,1,1,0,0], "1111100" "MonTueWedThuFri". The holidays are always specified as a one-dimensional array of dtype 'M8[D]', and are internally used in sorted form. A function *is_busday*(datearray, weekmask=, holidays=, busdaydef=) returns a boolean array matching the input datearray, with True for the valid business days. A function *busday_offset*(datearray, offsetarray, roll='raise', weekmask=, holidays=, busdaydef=) which first applies the 'roll' policy to start at a valid business date, then offsets the date by the number of business days specified in offsetarray. The arrays datearray and offsetarray are broadcast together. The 'roll' parameter can be 'forward'/'following', 'backward'/'preceding', 'modifiedfollowing', 'modifiedpreceding', or 'raise' (the default). A function *busday_count*(datearray1, datearray2, weekmask=, holidays=, busdaydef=) which calculates the number of business days between datearray1 and datearray2, not including the day of datearray2. For example, to find the first Monday in Feb 2011, >>>np.busday_offset('2011-02', 0, roll='forward', weekmask='Mon') or to find the number of weekdays in Feb 2011, >>>np.busday_count('2011-02', '2011-03') This set of three functions appears to be powerful enough to express the business-day computations that I've been shown thus far. Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 9 19:28:26 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 9 Jun 2011 18:28:26 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 16:27, Robert Kern wrote: > On Thu, Jun 9, 2011 at 15:01, Mark Wiebe wrote: >> I've replaced the previous two pull requests with a single pull request >> rolling up all the changes so far. The newest changes include finishing the >> generic unit and np.arange function support. >> https://github.com/numpy/numpy/pull/87 >> Because of the nature of datetime and timedelta, arange has to be slightly >> different than with all the other types. In particular, for datetime the >> primary signature is np.arange(datetime, datetime, timedelta). >> I've implemented a simple extension which allows for another way to specify >> a date range, as np.arange(datetime, timedelta, timedelta). Here (start, >> delta) represents the datetime range [start, start+delta). Some examples: >>>>> np.arange('2011', '2020', dtype='M8[Y]') >> array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', >> ? ? ? ?'2019'], dtype='datetime64[Y]') >>>>> np.arange('today', 10, 3, dtype='M8') >> array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], >> dtype='datetime64[D]') > > I would prefer that we not further overload the signature of > np.arange() for this case. A new function dtrange() that can take a > delta would be preferable. Alternately, a general np.deltarange(start, delta[, step]) function might be useful, too. I know I've done the following quite a few times, even with just integers: np.arange(start, start+delta) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Thu Jun 9 19:44:53 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 18:44:53 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 6:28 PM, Robert Kern wrote: > On Thu, Jun 9, 2011 at 16:27, Robert Kern wrote: > > On Thu, Jun 9, 2011 at 15:01, Mark Wiebe wrote: > >> I've replaced the previous two pull requests with a single pull request > >> rolling up all the changes so far. The newest changes include finishing > the > >> generic unit and np.arange function support. > >> https://github.com/numpy/numpy/pull/87 > >> Because of the nature of datetime and timedelta, arange has to be > slightly > >> different than with all the other types. In particular, for datetime the > >> primary signature is np.arange(datetime, datetime, timedelta). > >> I've implemented a simple extension which allows for another way to > specify > >> a date range, as np.arange(datetime, timedelta, timedelta). Here (start, > >> delta) represents the datetime range [start, start+delta). Some > examples: > >>>>> np.arange('2011', '2020', dtype='M8[Y]') > >> array(['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', > >> '2019'], dtype='datetime64[Y]') > >>>>> np.arange('today', 10, 3, dtype='M8') > >> array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], > >> dtype='datetime64[D]') > > > > I would prefer that we not further overload the signature of > > np.arange() for this case. A new function dtrange() that can take a > > delta would be preferable. > > Alternately, a general np.deltarange(start, delta[, step]) +1 function > might be useful, too. I know I've done the following quite a few > times, even with just integers: > > np.arange(start, start+delta) > I like this approach. -Mark > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless > enigma that is made terrible by our own mad attempt to interpret it as > though it had an underlying truth." > -- Umberto Eco > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 9 19:54:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 9 Jun 2011 18:54:51 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 5:21 PM, Ralf Gommers wrote: > > > On Thu, Jun 9, 2011 at 11:54 PM, Mark Wiebe wrote: > >> On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers > > wrote: >> >>> >>> >>> On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: >>> >>>> On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker < >>>> Chris.Barker at noaa.gov> wrote: >>>> >>>> Your branch works fine for me (OS X, py2.6), no failures. Only a few >>> deprecation warnings like: >>> /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: >>> DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they >>> are platform specific. Use 'O' instead >>> callableObj(*args, **kwargs) >>> >> >> It looks like there are some '|O4' dtypes in 'lib/tests/test_format.py', >> testing the .npy file format. I'm not sure why I'm not getting this warning >> though. >> >> >>> Mark Wiebe wrote: >>>>> > Because of the nature of datetime and timedelta, arange has to be >>>>> > slightly different than with all the other types. In particular, for >>>>> > datetime the primary signature is np.arange(datetime, datetime, >>>>> timedelta). >>>>> > >>>>> > I've implemented a simple extension which allows for another way to >>>>> > specify a date range, as np.arange(datetime, timedelta, timedelta). >>>>> >>>>> Did you think about how to document which of these basic functions work >>> with datetime? I don't think that belongs in the docstrings, but it may then >>> be hard for the user to figure out which functions accept datetimes. And >>> there will be no usage examples in the docstrings. >>> >> >> I think documenting it in a 'datetime' section of the arange >> documentation would be reasonable. The main datetime documentation page >> would also mention the functions that are most useful. >> >> Besides docs, I am not sure about your choice to modify functions like >>> arange instead of writing a module of wrapper functions for them that know >>> what to do with the dtype. If you have a module you can group all relevant >>> functions, so they're easy to find. Plus it's more future-proof - if at some >>> point numpy grows another new dtype, just create a new module with wrapper >>> funcs for that dtype. >>> >> >> The facts that datetime and timedelta are related in a particular way >> different from other data types, and that they are parameterized types, both >> contribute to them not fitting naturally the current structure of NumPy. I'm >> not sure I understand the module idea, >> > > Basically, use np.datetime.arange which understand the dtype, then calls > np.arange under the hood. Or is just its own function, like the dtrange() > Robert just suggested. It's pretty much the same as for the ma module, which > reimplements or wraps many numpy functions that do not understand masked > arrays. > I'm not a big fan of the way the ma module works, it doesn't integrate naturally and orthogonally with all the other features of NumPy. It's also an array subtype, quite different from a dtype. We don't have np.bool.arange, np.int8.arange, etc, and the abstraction used by arange built into the custom data type mechanism is too weak too support the needs of datetime. I'd like to use the requirements of datetime as a guide to molding the future design of the data type system, and if we make datetime a second-class citizen because it doesn't behave like a float, we're not going to be able to discover the possibilities. > I would rather think that since it's a built-in NumPy data type, it should >> work with the regular NumPy functions wherever that makes sense. >> > > That doesn't make sense to me. Being a dtype that happens to be shipped > with numpy doesn't make it more special than other dtypes. > This isn't making it more special, it's just conforming the natural NumPy way to how datetime/timedelta operates. If overloading the 'stop' parameter to support a relative stop based on the type system is going a step too far, that's easy to pull back, and I support Robert's idea of adding the deltarange function. Cheers, Mark > > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.anton.letnes at gmail.com Thu Jun 9 20:25:45 2011 From: paul.anton.letnes at gmail.com (Paul Anton Letnes) Date: Thu, 9 Jun 2011 17:25:45 -0700 Subject: [Numpy-discussion] silly user wish Message-ID: Dear numpy developers, users, and fans: I wish for a numpy that will magically detect all my errors. Barring the invention of a numpy version which knows what I need, not what I ask of it, I have the following, simpler wish: if one takes myarray.real or myarray.imag of a float (or possibly integer), give a warning saying "well that didn't make much sense, did it?". (Maybe there is some good reason why this is a bad idea, but the idea has been presented to you, nonetheless.) Best regards Paul. From robert.kern at gmail.com Thu Jun 9 22:07:14 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 9 Jun 2011 21:07:14 -0500 Subject: [Numpy-discussion] silly user wish In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 19:25, Paul Anton Letnes wrote: > Dear numpy developers, users, and fans: > > I wish for a numpy that will magically detect all my errors. Barring the invention of a numpy version which knows what I need, not what I ask of it, I have the following, simpler wish: if one takes myarray.real or myarray.imag of a float (or possibly integer), give a warning saying "well that didn't make much sense, did it?". > > (Maybe there is some good reason why this is a bad idea, but the idea has been presented to you, nonetheless.) Those attributes were explicitly added to all arrays (and numpy scalars!) because it makes it easier to write generic complex algorithms that will work correctly regardless of the dtype of the input without dispatching to different code paths. There's a reasonably large portion of complex algorithms that can be correctly implemented for real arrays simply by implicitly assuming that the imaginary component is all 0s. If you happen to have an algorithm where passing a float array is more likely an indicator of an error, you can do the check yourself. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From ralf.gommers at googlemail.com Fri Jun 10 01:56:01 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Fri, 10 Jun 2011 07:56:01 +0200 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Fri, Jun 10, 2011 at 1:54 AM, Mark Wiebe wrote: > On Thu, Jun 9, 2011 at 5:21 PM, Ralf Gommers wrote: > >> >> >> On Thu, Jun 9, 2011 at 11:54 PM, Mark Wiebe wrote: >> >>> On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers < >>> ralf.gommers at googlemail.com> wrote: >>> >>>> >>>> >>>> On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: >>>> >>>>> On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker < >>>>> Chris.Barker at noaa.gov> wrote: >>>>> >>>>> Your branch works fine for me (OS X, py2.6), no failures. Only a few >>>> deprecation warnings like: >>>> /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: >>>> DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they >>>> are platform specific. Use 'O' instead >>>> callableObj(*args, **kwargs) >>>> >>> >>> It looks like there are some '|O4' dtypes in 'lib/tests/test_format.py', >>> testing the .npy file format. I'm not sure why I'm not getting this warning >>> though. >>> >>> >>>> Mark Wiebe wrote: >>>>>> > Because of the nature of datetime and timedelta, arange has to be >>>>>> > slightly different than with all the other types. In particular, for >>>>>> > datetime the primary signature is np.arange(datetime, datetime, >>>>>> timedelta). >>>>>> > >>>>>> > I've implemented a simple extension which allows for another way to >>>>>> > specify a date range, as np.arange(datetime, timedelta, timedelta). >>>>>> >>>>>> Did you think about how to document which of these basic functions >>>> work with datetime? I don't think that belongs in the docstrings, but it may >>>> then be hard for the user to figure out which functions accept datetimes. >>>> And there will be no usage examples in the docstrings. >>>> >>> >>> I think documenting it in a 'datetime' section of the arange >>> documentation would be reasonable. The main datetime documentation page >>> would also mention the functions that are most useful. >>> >>> Besides docs, I am not sure about your choice to modify functions like >>>> arange instead of writing a module of wrapper functions for them that know >>>> what to do with the dtype. If you have a module you can group all relevant >>>> functions, so they're easy to find. Plus it's more future-proof - if at some >>>> point numpy grows another new dtype, just create a new module with wrapper >>>> funcs for that dtype. >>>> >>> >>> The facts that datetime and timedelta are related in a particular way >>> different from other data types, and that they are parameterized types, both >>> contribute to them not fitting naturally the current structure of NumPy. I'm >>> not sure I understand the module idea, >>> >> >> Basically, use np.datetime.arange which understand the dtype, then calls >> np.arange under the hood. Or is just its own function, like the dtrange() >> Robert just suggested. It's pretty much the same as for the ma module, which >> reimplements or wraps many numpy functions that do not understand masked >> arrays. >> > > I'm not a big fan of the way the ma module works, it doesn't integrate > naturally and orthogonally with all the other features of NumPy. It's also > an array subtype, quite different from a dtype. We don't have > np.bool.arange, np.int8.arange, etc, and the abstraction used by arange > built into the custom data type mechanism is too weak too support the needs > of datetime. > > I'd like to use the requirements of datetime as a guide to molding the > future design of the data type system, and if we make datetime a > second-class citizen because it doesn't behave like a float, we're not going > to be able to discover the possibilities. > > I would rather think that since it's a built-in NumPy data type, it should >>> work with the regular NumPy functions wherever that makes sense. >>> >> >> That doesn't make sense to me. Being a dtype that happens to be shipped >> with numpy doesn't make it more special than other dtypes. >> > > This isn't making it more special, it's just conforming the natural NumPy > way to how datetime/timedelta operates. > Maybe I'm misunderstanding this, and once you make a function work for datetime it would also work for other new dtypes. But my impression is that that's not the case. Let's say I make a new dtype with distance instead of time attached. Would I be able to use it with arange, or would I have to go in and change the arange implementation again to support it? Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonas.ruebsam at web.de Fri Jun 10 08:24:28 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 10 Jun 2011 05:24:28 -0700 (PDT) Subject: [Numpy-discussion] numpy function error Message-ID: <31817481.post@talk.nabble.com> Hello, i have the following problem, the following code doesnt work def f1(x): return self.lgdata[2*i][0]*float(x)+self.lgdata[2*i][1] def f2(x): return self.lgdata[2*i+1][0]*float(x)+self.lgdata[2*i+1][1] f1=f1(self.ptdata[i]) 2=f2(self.ptdata[i]) t=abs(f1-f2) deltat.append(t) temptot.append(f1) rezipr.append(1/t) xul , xur = self.ptdata[i]-0.001, self.ptdata[i]+0.001 print xul,xur,f1,f2,"test \n" verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] it gives me the following error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/matplotlib/backends/backend_gtk.py", line 265, in key_press_event FigureCanvasBase.key_press_event(self, key, guiEvent=event) File "/usr/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 1523, in key_press_event self.callbacks.process(s, event) File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 265, in process proxy(*args, **kwargs) File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 191, in __call__ return mtd(*args, **kwargs) File "auswertung.py", line 103, in __call__ verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] TypeError: 'numpy.float64' object is not callable have no idea where the problem is, since ptdata is an 1dim array there should be no problem to pass it to a function ? -- View this message in context: http://old.nabble.com/numpy-function-error-tp31817481p31817481.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From jonas.ruebsam at web.de Fri Jun 10 08:35:26 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 10 Jun 2011 05:35:26 -0700 (PDT) Subject: [Numpy-discussion] numpy function error Message-ID: <31817562.post@talk.nabble.com> Hello, i have the following problem, the following code doesnt work def f1(x): return self.lgdata[2*i][0]*float(x)+self.lgdata[2*i][1] def f2(x): return self.lgdata[2*i+1][0]*float(x)+self.lgdata[2*i+1][1] f1=f1(self.ptdata[i]) f2=f2(self.ptdata[i]) t=abs(f1-f2) deltat.append(t) temptot.append(f1) rezipr.append(1/t) xul , xur = self.ptdata[i]-0.001, self.ptdata[i]+0.001 print xul,xur,f1,f2,"test \n" verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] it gives me the following error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/matplotlib/backends/backend_gtk.py", line 265, in key_press_event FigureCanvasBase.key_press_event(self, key, guiEvent=event) File "/usr/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 1523, in key_press_event self.callbacks.process(s, event) File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 265, in process proxy(*args, **kwargs) File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 191, in __call__ return mtd(*args, **kwargs) File "auswertung.py", line 103, in __call__ verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] TypeError: 'numpy.float64' object is not callable have no idea where the problem is, since ptdata is an 1dim array there should be no problem to pass it to a function ? -- View this message in context: http://old.nabble.com/numpy-function-error-tp31817562p31817562.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From shish at keba.be Fri Jun 10 08:48:20 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 08:48:20 -0400 Subject: [Numpy-discussion] numpy function error In-Reply-To: <31817481.post@talk.nabble.com> References: <31817481.post@talk.nabble.com> Message-ID: You are overriding your f1 function with a float (with "f1=f1(self.ptdata[i])"), so trying to call f1(xul) later will raise this exception. -=- Olivier 2011/6/10 jonasr > > Hello, > i have the following problem, the following code doesnt work > > def f1(x): return self.lgdata[2*i][0]*float(x)+self.lgdata[2*i][1] > def f2(x): return self.lgdata[2*i+1][0]*float(x)+self.lgdata[2*i+1][1] > f1=f1(self.ptdata[i]) > 2=f2(self.ptdata[i]) > t=abs(f1-f2) > deltat.append(t) > temptot.append(f1) > rezipr.append(1/t) > xul , xur = self.ptdata[i]-0.001, self.ptdata[i]+0.001 > print xul,xur,f1,f2,"test \n" > verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] > > it gives me the following error: > > Traceback (most recent call last): > File > "/usr/lib/python2.7/site-packages/matplotlib/backends/backend_gtk.py", line > 265, in key_press_event > FigureCanvasBase.key_press_event(self, key, guiEvent=event) > File "/usr/lib/python2.7/site-packages/matplotlib/backend_bases.py", line > 1523, in key_press_event > self.callbacks.process(s, event) > File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 265, in > process > proxy(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 191, in > __call__ > return mtd(*args, **kwargs) > File "auswertung.py", line 103, in __call__ > verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] > TypeError: 'numpy.float64' object is not callable > > have no idea where the problem is, since ptdata is an 1dim array there > should be no problem to > pass it to a function ? > > > > -- > View this message in context: > http://old.nabble.com/numpy-function-error-tp31817481p31817481.html > Sent from the Numpy-discussion mailing list archive at Nabble.com. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonas.ruebsam at web.de Fri Jun 10 09:05:21 2011 From: jonas.ruebsam at web.de (jonasr) Date: Fri, 10 Jun 2011 06:05:21 -0700 (PDT) Subject: [Numpy-discussion] numpy function error In-Reply-To: References: Message-ID: <31817786.post@talk.nabble.com> ooh , my bad your right didnt see that thank you Olivier Delalleau-2 wrote: > > You are overriding your f1 function with a float (with > "f1=f1(self.ptdata[i])"), so trying to call f1(xul) later will raise this > exception. > > -=- Olivier > > 2011/6/10 jonasr > >> >> Hello, >> i have the following problem, the following code doesnt work >> >> def f1(x): return self.lgdata[2*i][0]*float(x)+self.lgdata[2*i][1] >> def f2(x): return self.lgdata[2*i+1][0]*float(x)+self.lgdata[2*i+1][1] >> f1=f1(self.ptdata[i]) >> 2=f2(self.ptdata[i]) >> t=abs(f1-f2) >> deltat.append(t) >> temptot.append(f1) >> rezipr.append(1/t) >> xul , xur = self.ptdata[i]-0.001, self.ptdata[i]+0.001 >> print xul,xur,f1,f2,"test \n" >> verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] >> >> it gives me the following error: >> >> Traceback (most recent call last): >> File >> "/usr/lib/python2.7/site-packages/matplotlib/backends/backend_gtk.py", >> line >> 265, in key_press_event >> FigureCanvasBase.key_press_event(self, key, guiEvent=event) >> File "/usr/lib/python2.7/site-packages/matplotlib/backend_bases.py", >> line >> 1523, in key_press_event >> self.callbacks.process(s, event) >> File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 265, >> in >> process >> proxy(*args, **kwargs) >> File "/usr/lib/python2.7/site-packages/matplotlib/cbook.py", line 191, >> in >> __call__ >> return mtd(*args, **kwargs) >> File "auswertung.py", line 103, in __call__ >> verts =[(xul,f1(xul)),(xul,f2(xul)),(xur,f2(xur)),(xur,f1(xur))] >> TypeError: 'numpy.float64' object is not callable >> >> have no idea where the problem is, since ptdata is an 1dim array there >> should be no problem to >> pass it to a function ? >> >> >> >> -- >> View this message in context: >> http://old.nabble.com/numpy-function-error-tp31817481p31817481.html >> Sent from the Numpy-discussion mailing list archive at Nabble.com. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- View this message in context: http://old.nabble.com/Re%3A-numpy-function-error-tp31817652p31817786.html Sent from the Numpy-discussion mailing list archive at Nabble.com. From bbelson at princeton.edu Fri Jun 10 09:14:52 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Fri, 10 Jun 2011 09:14:52 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: Hi, Thanks for getting back to me. I'm doing element wise multiplication, basically innerProduct = numpy.sum(array1*array2) where array1 and array2 are, in general, multidimensional. I need to do many of these operations, and I'd like to split up the tasks between the different cores. I'm not using numpy.dot, if I'm not mistaken I don't think that would do what I need. Thanks again, Brandt Message: 1 > Date: Thu, 09 Jun 2011 13:11:40 -0700 > From: Christopher Barker > Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > with numpy array multiplication > To: Discussion of Numerical Python > Message-ID: <4DF128FC.8000807 at noaa.gov> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Not much time, here, but since you got no replies earlier: > > > > > I'm parallelizing some code I've written using the built in > > multiprocessing > > > module. In my application, I need to multiply many large arrays > > together > > is the matrix multiplication, or element-wise? If matrix, then numpy > should be using LAPACK, which, depending on how its built, could be > using all your cores already. This is heavily dependent on your your > numpy (really the LAPACK it uses0 is built. > > > > and > > > sum the resulting product arrays (inner products). > > are you using numpy.dot() for that? If so, then the above applies to > that as well. > > I know I could look at your code to answer these questions, but I > thought this might help. > > -Chris > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 09:23:10 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 09:23:10 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: It may not work for you depending on your specific problem constraints, but if you could flatten the arrays, then it would be a dot, and you could maybe compute multiple such dot products by storing those flattened arrays into a matrix. -=- Olivier 2011/6/10 Brandt Belson > Hi, > Thanks for getting back to me. > I'm doing element wise multiplication, basically innerProduct = > numpy.sum(array1*array2) where array1 and array2 are, in general, > multidimensional. I need to do many of these operations, and I'd like to > split up the tasks between the different cores. I'm not using numpy.dot, if > I'm not mistaken I don't think that would do what I need. > Thanks again, > Brandt > > > Message: 1 >> Date: Thu, 09 Jun 2011 13:11:40 -0700 >> From: Christopher Barker >> Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) >> with numpy array multiplication >> To: Discussion of Numerical Python >> Message-ID: <4DF128FC.8000807 at noaa.gov> >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >> >> Not much time, here, but since you got no replies earlier: >> >> >> > > I'm parallelizing some code I've written using the built in >> > multiprocessing >> > > module. In my application, I need to multiply many large arrays >> > together >> >> is the matrix multiplication, or element-wise? If matrix, then numpy >> should be using LAPACK, which, depending on how its built, could be >> using all your cores already. This is heavily dependent on your your >> numpy (really the LAPACK it uses0 is built. >> >> > > and >> > > sum the resulting product arrays (inner products). >> >> are you using numpy.dot() for that? If so, then the above applies to >> that as well. >> >> I know I could look at your code to answer these questions, but I >> thought this might help. >> >> -Chris >> >> >> >> >> >> -- >> Christopher Barker, Ph.D. >> Oceanographer >> >> Emergency Response Division >> NOAA/NOS/OR&R (206) 526-6959 voice >> 7600 Sand Point Way NE (206) 526-6329 fax >> Seattle, WA 98115 (206) 526-6317 main reception >> >> Chris.Barker at noaa.gov >> >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 10:18:31 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 09:18:31 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Fri, Jun 10, 2011 at 12:56 AM, Ralf Gommers wrote: > > > On Fri, Jun 10, 2011 at 1:54 AM, Mark Wiebe wrote: > >> On Thu, Jun 9, 2011 at 5:21 PM, Ralf Gommers > > wrote: >> >>> >>> >>> On Thu, Jun 9, 2011 at 11:54 PM, Mark Wiebe wrote: >>> >>>> On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers < >>>> ralf.gommers at googlemail.com> wrote: >>>> >>>>> >>>>> >>>>> On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe wrote: >>>>> >>>>>> On Thu, Jun 9, 2011 at 3:41 PM, Christopher Barker < >>>>>> Chris.Barker at noaa.gov> wrote: >>>>>> >>>>>> Your branch works fine for me (OS X, py2.6), no failures. Only a few >>>>> deprecation warnings like: >>>>> /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: >>>>> DeprecationWarning: DType strings 'O4' and 'O8' are deprecated because they >>>>> are platform specific. Use 'O' instead >>>>> callableObj(*args, **kwargs) >>>>> >>>> >>>> It looks like there are some '|O4' dtypes in 'lib/tests/test_format.py', >>>> testing the .npy file format. I'm not sure why I'm not getting this warning >>>> though. >>>> >>>> >>>>> Mark Wiebe wrote: >>>>>>> > Because of the nature of datetime and timedelta, arange has to be >>>>>>> > slightly different than with all the other types. In particular, >>>>>>> for >>>>>>> > datetime the primary signature is np.arange(datetime, datetime, >>>>>>> timedelta). >>>>>>> > >>>>>>> > I've implemented a simple extension which allows for another way to >>>>>>> > specify a date range, as np.arange(datetime, timedelta, timedelta). >>>>>>> >>>>>>> Did you think about how to document which of these basic functions >>>>> work with datetime? I don't think that belongs in the docstrings, but it may >>>>> then be hard for the user to figure out which functions accept datetimes. >>>>> And there will be no usage examples in the docstrings. >>>>> >>>> >>>> I think documenting it in a 'datetime' section of the arange >>>> documentation would be reasonable. The main datetime documentation page >>>> would also mention the functions that are most useful. >>>> >>>> Besides docs, I am not sure about your choice to modify functions like >>>>> arange instead of writing a module of wrapper functions for them that know >>>>> what to do with the dtype. If you have a module you can group all relevant >>>>> functions, so they're easy to find. Plus it's more future-proof - if at some >>>>> point numpy grows another new dtype, just create a new module with wrapper >>>>> funcs for that dtype. >>>>> >>>> >>>> The facts that datetime and timedelta are related in a particular way >>>> different from other data types, and that they are parameterized types, both >>>> contribute to them not fitting naturally the current structure of NumPy. I'm >>>> not sure I understand the module idea, >>>> >>> >>> Basically, use np.datetime.arange which understand the dtype, then calls >>> np.arange under the hood. Or is just its own function, like the dtrange() >>> Robert just suggested. It's pretty much the same as for the ma module, which >>> reimplements or wraps many numpy functions that do not understand masked >>> arrays. >>> >> >> I'm not a big fan of the way the ma module works, it doesn't integrate >> naturally and orthogonally with all the other features of NumPy. It's also >> an array subtype, quite different from a dtype. We don't have >> np.bool.arange, np.int8.arange, etc, and the abstraction used by arange >> built into the custom data type mechanism is too weak too support the needs >> of datetime. >> >> > I'd like to use the requirements of datetime as a guide to molding the >> future design of the data type system, and if we make datetime a >> second-class citizen because it doesn't behave like a float, we're not going >> to be able to discover the possibilities. >> >> > I would rather think that since it's a built-in NumPy data type, it >>>> should work with the regular NumPy functions wherever that makes sense. >>>> >>> >>> That doesn't make sense to me. Being a dtype that happens to be shipped >>> with numpy doesn't make it more special than other dtypes. >>> >> >> This isn't making it more special, it's just conforming the natural NumPy >> way to how datetime/timedelta operates. >> > > Maybe I'm misunderstanding this, and once you make a function work for > datetime it would also work for other new dtypes. But my impression is that > that's not the case. Let's say I make a new dtype with distance instead of > time attached. Would I be able to use it with arange, or would I have to go > in and change the arange implementation again to support it? > Ok, I think I understand the point you're driving at now. We need NumPy to have the flexibility so that external plugins defining custom data types can do the same thing that datetime does, having datetime be special compared to those is undesirable. This I wholeheartedly agree with, and the way I'm coding datetime is driving in that direction. The state of the NumPy codebase, however, prevents jumping straight to such a solution, since there are several mechanisms and layers of such abstraction already which themselves do not satisfy the needs of datetime and other similar types. Also, arange is just one of a large number of functions which could be extended individually for different types, for example if one wanted to make a unit quaternion data type for manipulating rotations, its needs would be significantly different. Because I can't see the big picture from within the world of datetime, and because such generalization takes a great amount of effort, I'm instead making these changes minimally invasive on the current codebase. This approach is along the lines of "lifting" in generic programming. First you write your algorithm in the specific domain, and work out how it behaves there. Then, you determine what are the minimal requirements of the types involved, and abstract the algorithm appropriately. Jumping to the second step directly is generally too difficult, an incremental path to the final goal must be used. http://www.generic-programming.org/about/intro/lifting.php Cheers, Mark > > Ralf > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbelson at princeton.edu Fri Jun 10 11:01:48 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Fri, 10 Jun 2011 11:01:48 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: Unfortunately I can't flatten the arrays. I'm writing a library where the user supplies an inner product function for two generic objects, and almost always the inner product function does large array multiplications at some point. The library doesn't get to know about the underlying arrays. Thanks, Brandt > Message: 2 > Date: Fri, 10 Jun 2011 09:23:10 -0400 > From: Olivier Delalleau > Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > with numpy array multiplication > To: Discussion of Numerical Python > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > It may not work for you depending on your specific problem constraints, but > if you could flatten the arrays, then it would be a dot, and you could > maybe > compute multiple such dot products by storing those flattened arrays into a > matrix. > > -=- Olivier > > 2011/6/10 Brandt Belson > > > Hi, > > Thanks for getting back to me. > > I'm doing element wise multiplication, basically innerProduct = > > numpy.sum(array1*array2) where array1 and array2 are, in general, > > multidimensional. I need to do many of these operations, and I'd like to > > split up the tasks between the different cores. I'm not using numpy.dot, > if > > I'm not mistaken I don't think that would do what I need. > > Thanks again, > > Brandt > > > > > > Message: 1 > >> Date: Thu, 09 Jun 2011 13:11:40 -0700 > >> From: Christopher Barker > >> Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > >> with numpy array multiplication > >> To: Discussion of Numerical Python > >> Message-ID: <4DF128FC.8000807 at noaa.gov> > >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >> > >> Not much time, here, but since you got no replies earlier: > >> > >> > >> > > I'm parallelizing some code I've written using the built in > >> > multiprocessing > >> > > module. In my application, I need to multiply many large arrays > >> > together > >> > >> is the matrix multiplication, or element-wise? If matrix, then numpy > >> should be using LAPACK, which, depending on how its built, could be > >> using all your cores already. This is heavily dependent on your your > >> numpy (really the LAPACK it uses0 is built. > >> > >> > > and > >> > > sum the resulting product arrays (inner products). > >> > >> are you using numpy.dot() for that? If so, then the above applies to > >> that as well. > >> > >> I know I could look at your code to answer these questions, but I > >> thought this might help. > >> > >> -Chris > >> > >> > >> > >> > >> > >> -- > >> Christopher Barker, Ph.D. > >> Oceanographer > >> > >> Emergency Response Division > >> NOAA/NOS/OR&R (206) 526-6959 voice > >> 7600 Sand Point Way NE (206) 526-6329 fax > >> Seattle, WA 98115 (206) 526-6317 main reception > >> > >> Chris.Barker at noaa.gov > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Fri Jun 10 11:03:16 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 10 Jun 2011 10:03:16 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: <4DF23234.5030508@gmail.com> On 06/10/2011 09:18 AM, Mark Wiebe wrote: > On Fri, Jun 10, 2011 at 12:56 AM, Ralf Gommers > > wrote: > > > > On Fri, Jun 10, 2011 at 1:54 AM, Mark Wiebe > wrote: > > On Thu, Jun 9, 2011 at 5:21 PM, Ralf Gommers > > wrote: > > > > On Thu, Jun 9, 2011 at 11:54 PM, Mark Wiebe > > wrote: > > On Thu, Jun 9, 2011 at 4:27 PM, Ralf Gommers > > wrote: > > > > On Thu, Jun 9, 2011 at 10:58 PM, Mark Wiebe > > wrote: > > On Thu, Jun 9, 2011 at 3:41 PM, Christopher > Barker > wrote: > > Your branch works fine for me (OS X, py2.6), no > failures. Only a few deprecation warnings like: > /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py:336: > DeprecationWarning: DType strings 'O4' and 'O8' > are deprecated because they are platform specific. > Use 'O' instead > callableObj(*args, **kwargs) > > > It looks like there are some '|O4' dtypes in > 'lib/tests/test_format.py', testing the .npy file > format. I'm not sure why I'm not getting this warning > though. > > Mark Wiebe wrote: > > Because of the nature of datetime and > timedelta, arange has to be > > slightly different than with all the > other types. In particular, for > > datetime the primary signature is > np.arange(datetime, datetime, timedelta). > > > > I've implemented a simple extension > which allows for another way to > > specify a date range, as > np.arange(datetime, timedelta, timedelta). > > Did you think about how to document which of these > basic functions work with datetime? I don't think > that belongs in the docstrings, but it may then be > hard for the user to figure out which functions > accept datetimes. And there will be no usage > examples in the docstrings. > > > I think documenting it in a 'datetime' section of the > arange documentation would be reasonable. The main > datetime documentation page would also mention the > functions that are most useful. > > Besides docs, I am not sure about your choice to > modify functions like arange instead of writing a > module of wrapper functions for them that know > what to do with the dtype. If you have a module > you can group all relevant functions, so they're > easy to find. Plus it's more future-proof - if at > some point numpy grows another new dtype, just > create a new module with wrapper funcs for that dtype. > > > The facts that datetime and timedelta are related in a > particular way different from other data types, and > that they are parameterized types, both contribute to > them not fitting naturally the current structure of > NumPy. I'm not sure I understand the module idea, > > > Basically, use np.datetime.arange which understand the > dtype, then calls np.arange under the hood. Or is just its > own function, like the dtrange() Robert just suggested. > It's pretty much the same as for the ma module, which > reimplements or wraps many numpy functions that do not > understand masked arrays. > > > I'm not a big fan of the way the ma module works, it doesn't > integrate naturally and orthogonally with all the other > features of NumPy. It's also an array subtype, quite different > from a dtype. We don't have np.bool.arange, np.int8.arange, > etc, and the abstraction used by arange built into the custom > data type mechanism is too weak too support the needs of datetime. > > I'd like to use the requirements of datetime as a guide > to molding the future design of the data type system, and if > we make datetime a second-class citizen because it doesn't > behave like a float, we're not going to be able to discover > the possibilities. > > I would rather think that since it's a built-in NumPy > data type, it should work with the regular NumPy > functions wherever that makes sense. > > > That doesn't make sense to me. Being a dtype that happens > to be shipped with numpy doesn't make it more special than > other dtypes. > > > This isn't making it more special, it's just conforming the > natural NumPy way to how datetime/timedelta operates. > > Maybe I'm misunderstanding this, and once you make a function work > for datetime it would also work for other new dtypes. But my > impression is that that's not the case. Let's say I make a new > dtype with distance instead of time attached. Would I be able to > use it with arange, or would I have to go in and change the arange > implementation again to support it? > > > Ok, I think I understand the point you're driving at now. We need > NumPy to have the flexibility so that external plugins defining custom > data types can do the same thing that datetime does, having datetime > be special compared to those is undesirable. This I wholeheartedly > agree with, and the way I'm coding datetime is driving in that direction. > > The state of the NumPy codebase, however, prevents jumping straight to > such a solution, since there are several mechanisms and layers of such > abstraction already which themselves do not satisfy the needs of > datetime and other similar types. Also, arange is just one of a large > number of functions which could be extended individually for different > types, for example if one wanted to make a unit quaternion data type > for manipulating rotations, its needs would be significantly > different. Because I can't see the big picture from within the world > of datetime, and because such generalization takes a great amount of > effort, I'm instead making these changes minimally invasive on the > current codebase. > > This approach is along the lines of "lifting" in generic programming. > First you write your algorithm in the specific domain, and work out > how it behaves there. Then, you determine what are the minimal > requirements of the types involved, and abstract the algorithm > appropriately. Jumping to the second step directly is generally too > difficult, an incremental path to the final goal must be used. > > http://www.generic-programming.org/about/intro/lifting.php > > Cheers, > Mark > > > Ralf > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion I have following the multiple date/time discussions with some interest as it is clear there is not 'one way' (perhaps it's Dutch). But, I do keep coming back to Chris's concepts of time as a strict unit of measure and time as a calender. So I do think that types of changes are rather premature without defining a some base measurement of time - probably some thing like Unix time or International Atomic Time (TAI) but not UTC due to leap seconds (http://en.wikipedia.org/wiki/Leap_second). Leap seconds make using UTC rather problematic for a couple of reasons: 1) It's essentially only historical. A range of the seconds in December 2011 computed 'now' in June 2011 using UTC might be different than a range calculated in a couple weeks if leaps seconds are added to December 2011. 2) There is also the issue that 23:59:60 December 31, 2008 UTC is a valid time but not for other years like 2009 and 2010. It also means that you have to be careful of doing experiments that require accuracy of seconds or less because a 1 second gap could be recorded as a 2 second gap. The other issue is how do you define the np.arange step argument since that can be in different scales such as month, years, seconds? Can a user specific days and get half-days (like 1.5 days) or must these be 'integer' days? Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Fri Jun 10 12:18:31 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Fri, 10 Jun 2011 09:18:31 -0700 Subject: [Numpy-discussion] Returning the same dtype in Cython as np.argmax In-Reply-To: <915111FA-820F-4E69-A622-7BD3B2F99191@enthought.com> References: <915111FA-820F-4E69-A622-7BD3B2F99191@enthought.com> Message-ID: On Wed, Jun 8, 2011 at 9:49 PM, Travis Oliphant wrote: > On Jun 7, 2011, at 3:17 PM, Keith Goodman wrote: >> What is the rule to determine the dtype returned by numpy functions >> that return indices such as np.argmax? > > The return type of indices will be np.intp. Thanks, Travis. I now know a little more about NumPy. > I don't know if Cython provides that type or not. ? If not, then it is either np.int32 or np.int64 depending on the system. This part was answered on the cython list: http://groups.google.com/group/cython-users/browse_thread/thread/f8022ee7ccbf7c5b Short answer is that it is not part of cython 0.14.1 but it is easy to add. From kbasye1 at jhu.edu Fri Jun 10 12:40:55 2011 From: kbasye1 at jhu.edu (Ken Basye) Date: Fri, 10 Jun 2011 12:40:55 -0400 Subject: [Numpy-discussion] Object array from list in 1.6.0 (vs. 1.5.1) In-Reply-To: References: Message-ID: <4DF24917.5060703@jhu.edu> Dear folks, I have some code that stopped working with 1.6.0 and I'm wondering if there's a better way to replace it than what I came up with. Here's a condensed version: x = [()] # list containing an empty tuple; this isn't the only case, but it's one that must be handled correctly y = np.empty(len(x), dtype=object) y[:] = x[:] In 1.5.1 this works, giving y as "array([()], dtype=object)" In 1.6.0, it raises a ValueError: ValueError: output operand requires a reduction, but reduction is not enabled I didn't see anything in the release notes about this; admittedly it's a very small corner case. I also don't understand what the error message is trying to tell me here. Most of the more straightforward ways to construct the desired array don't work because the interpretation of a nested structure is as a multi-dimensional array, which is reasonable. At this point my workaround is to replace the assignment with a loop: for i, v in enumerate(x): y[i] = v but this seems non-Numpyish. Any light on what the error means or a better way to do this would be most welcome. Thanks, Ken From ben.root at ou.edu Fri Jun 10 15:50:32 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 14:50:32 -0500 Subject: [Numpy-discussion] numpy type mismatch Message-ID: Came across an odd error while using numpy master. Note, my system is 32-bits. >>> import numpy as np >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 False >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 True >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 True >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 True So, only the summation performed with a np.int32 accumulator results in a type that doesn't match the expected type. Now, for even more strangeness: >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) '0x9599a0' >>> hex(id(np.int32)) '0x959a80' So, the type from the sum() reports itself as a numpy int, but its memory address is different from the memory address for np.int32. Weirdness... Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 10 16:02:21 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jun 2011 14:02:21 -0600 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: > Came across an odd error while using numpy master. Note, my system is > 32-bits. > > >>> import numpy as np > >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 > False > >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 > True > >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 > True > >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 > True > > So, only the summation performed with a np.int32 accumulator results in a > type that doesn't match the expected type. Now, for even more strangeness: > > >>> type(np.sum([1, 2, 3], dtype=np.int32)) > > >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) > '0x9599a0' > >>> hex(id(np.int32)) > '0x959a80' > > So, the type from the sum() reports itself as a numpy int, but its memory > address is different from the memory address for np.int32. > > One of them is probably a long, print out the typecode, dtype.char. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 16:08:19 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 16:08:19 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: It's ok to have two different dtypes (in the sense that "d1 is not d2") such that they represent the same kind of data (in the sense that d1 == d2). However I think your very first test should have returned True (for what it's worth, it returns true with 1.5.1 on Windows 32 bit). -=- Olivier 2011/6/10 Benjamin Root > Came across an odd error while using numpy master. Note, my system is > 32-bits. > > >>> import numpy as np > >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 > False > >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 > True > >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 > True > >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 > True > > So, only the summation performed with a np.int32 accumulator results in a > type that doesn't match the expected type. Now, for even more strangeness: > > >>> type(np.sum([1, 2, 3], dtype=np.int32)) > > >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) > '0x9599a0' > >>> hex(id(np.int32)) > '0x959a80' > > So, the type from the sum() reports itself as a numpy int, but its memory > address is different from the memory address for np.int32. > > Weirdness... > Ben Root > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 16:17:03 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 15:17:03 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris wrote: > > > On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: > >> Came across an odd error while using numpy master. Note, my system is >> 32-bits. >> >> >>> import numpy as np >> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >> False >> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >> True >> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >> True >> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >> True >> >> So, only the summation performed with a np.int32 accumulator results in a >> type that doesn't match the expected type. Now, for even more strangeness: >> >> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >> >> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >> '0x9599a0' >> >>> hex(id(np.int32)) >> '0x959a80' >> >> So, the type from the sum() reports itself as a numpy int, but its memory >> address is different from the memory address for np.int32. >> >> > One of them is probably a long, print out the typecode, dtype.char. > > Chuck > > > Good intuition, but odd result... >>> import numpy as np >>> a = np.sum([1, 2, 3], dtype=np.int32) >>> b = np.int32(6) >>> type(a) >>> type(b) >>> a.dtype.char 'i' >>> b.dtype.char 'l' So, the standard np.int32 is getting listed as a long somehow? To further investigate: >>> a.dtype.itemsize 4 >>> b.dtype.itemsize 4 So, at least the sizes are right. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 16:24:36 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 16:24:36 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Benjamin Root > > > On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >> >>> Came across an odd error while using numpy master. Note, my system is >>> 32-bits. >>> >>> >>> import numpy as np >>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>> False >>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>> True >>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>> True >>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>> True >>> >>> So, only the summation performed with a np.int32 accumulator results in a >>> type that doesn't match the expected type. Now, for even more strangeness: >>> >>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>> >>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>> '0x9599a0' >>> >>> hex(id(np.int32)) >>> '0x959a80' >>> >>> So, the type from the sum() reports itself as a numpy int, but its memory >>> address is different from the memory address for np.int32. >>> >>> >> One of them is probably a long, print out the typecode, dtype.char. >> >> Chuck >> >> >> > Good intuition, but odd result... > > >>> import numpy as np > >>> a = np.sum([1, 2, 3], dtype=np.int32) > >>> b = np.int32(6) > >>> type(a) > > >>> type(b) > > >>> a.dtype.char > 'i' > >>> b.dtype.char > 'l' > > So, the standard np.int32 is getting listed as a long somehow? To further > investigate: > > >>> a.dtype.itemsize > 4 > >>> b.dtype.itemsize > 4 > > So, at least the sizes are right. > > Ben Root > long on a 32 bit computer is indeed int32. I think your issue is that in your version of numpy, numpy.dtype('i') != numpy.dtype('l') (while they are equal e.g. in Numpy 1.5.1). -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 16:24:44 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 15:24:44 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Friday, June 10, 2011, Olivier Delalleau wrote: > It's ok to have two different dtypes (in the sense that "d1 is not d2") such that they represent the same kind of data (in the sense that d1 == d2). Note that the memory addresses for int64, float32 and float64 accumulators did match the expected type. This issue with int32 seems to be unique. Ben Root From charlesr.harris at gmail.com Fri Jun 10 16:24:57 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jun 2011 14:24:57 -0600 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: > > > On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >> >>> Came across an odd error while using numpy master. Note, my system is >>> 32-bits. >>> >>> >>> import numpy as np >>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>> False >>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>> True >>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>> True >>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>> True >>> >>> So, only the summation performed with a np.int32 accumulator results in a >>> type that doesn't match the expected type. Now, for even more strangeness: >>> >>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>> >>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>> '0x9599a0' >>> >>> hex(id(np.int32)) >>> '0x959a80' >>> >>> So, the type from the sum() reports itself as a numpy int, but its memory >>> address is different from the memory address for np.int32. >>> >>> >> One of them is probably a long, print out the typecode, dtype.char. >> >> Chuck >> >> >> > Good intuition, but odd result... > > >>> import numpy as np > >>> a = np.sum([1, 2, 3], dtype=np.int32) > >>> b = np.int32(6) > >>> type(a) > > >>> type(b) > > >>> a.dtype.char > 'i' > >>> b.dtype.char > 'l' > > So, the standard np.int32 is getting listed as a long somehow? To further > investigate: > > Yes, long shifts around from int32 to int64 depending on the OS. For instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. On 32 bit systems it is 32 bits. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 17:43:01 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 16:43:01 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris wrote: > > > On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: > >> >> >> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>> >>>> Came across an odd error while using numpy master. Note, my system is >>>> 32-bits. >>>> >>>> >>> import numpy as np >>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>> False >>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>> True >>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>> True >>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>> True >>>> >>>> So, only the summation performed with a np.int32 accumulator results in >>>> a type that doesn't match the expected type. Now, for even more >>>> strangeness: >>>> >>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>> >>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>> '0x9599a0' >>>> >>> hex(id(np.int32)) >>>> '0x959a80' >>>> >>>> So, the type from the sum() reports itself as a numpy int, but its >>>> memory address is different from the memory address for np.int32. >>>> >>>> >>> One of them is probably a long, print out the typecode, dtype.char. >>> >>> Chuck >>> >>> >>> >> Good intuition, but odd result... >> >> >>> import numpy as np >> >>> a = np.sum([1, 2, 3], dtype=np.int32) >> >>> b = np.int32(6) >> >>> type(a) >> >> >>> type(b) >> >> >>> a.dtype.char >> 'i' >> >>> b.dtype.char >> 'l' >> >> So, the standard np.int32 is getting listed as a long somehow? To further >> investigate: >> >> > Yes, long shifts around from int32 to int64 depending on the OS. For > instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. > On 32 bit systems it is 32 bits. > > Chuck > > Right, that makes sense. But, the question is why does sum() put out a result dtype that is not identical to the dtype that I requested, or even the dtype of the input array? Could this be an indication of a bug somewhere? Even if the bug is harmless (it was only noticed within the test suite of larry), is this unexpected? Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 18:49:18 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 17:49:18 -0500 Subject: [Numpy-discussion] datetime business day API In-Reply-To: References: Message-ID: I've implemented the busday_offset function with support for the weekmask and roll parameters, the commits are tagged 'datetime-bday' in the pull request here: https://github.com/numpy/numpy/pull/87 -Mark On Thu, Jun 9, 2011 at 5:23 PM, Mark Wiebe wrote: > Here's a possible design for a business day API for numpy datetimes: > > > The 'B' business day unit will be removed. All business day-related > calculations will be done using the 'D' day unit. > > A class *BusinessDayDef* to encapsulate the definition of the business > week and holidays. The business day functions will either take one of these > objects, or separate weekmask and holidays parameters, to specify the > business day definition. This class serves as both a performance > optimization and a way to encapsulate the weekmask and holidays together, > for example if you want to make a dictionary mapping exchange names to their > trading days definition. > > The weekmask can be specified in a number of ways, and internally becomes a > boolean array with 7 elements with True for the days Monday through Sunday > which are valid business days. Some different notations are for the 5-day > week include [1,1,1,1,1,0,0], "1111100" "MonTueWedThuFri". The holidays are > always specified as a one-dimensional array of dtype 'M8[D]', and are > internally used in sorted form. > > > A function *is_busday*(datearray, weekmask=, holidays=, busdaydef=) > returns a boolean array matching the input datearray, with True for the > valid business days. > > A function *busday_offset*(datearray, offsetarray, > roll='raise', weekmask=, holidays=, busdaydef=) which first applies the > 'roll' policy to start at a valid business date, then offsets the date by > the number of business days specified in offsetarray. The arrays datearray > and offsetarray are broadcast together. The 'roll' parameter can be > 'forward'/'following', 'backward'/'preceding', 'modifiedfollowing', > 'modifiedpreceding', or 'raise' (the default). > > A function *busday_count*(datearray1, datearray2, weekmask=, holidays=, > busdaydef=) which calculates the number of business days between datearray1 > and datearray2, not including the day of datearray2. > > > For example, to find the first Monday in Feb 2011, > > >>>np.busday_offset('2011-02', 0, roll='forward', weekmask='Mon') > > or to find the number of weekdays in Feb 2011, > > >>>np.busday_count('2011-02', '2011-03') > > This set of three functions appears to be powerful enough to express the > business-day computations that I've been shown thus far. > > Cheers, > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 18:57:49 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 17:57:49 -0500 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: <4DF23234.5030508@gmail.com> References: <4DF13004.5090304@noaa.gov> <4DF23234.5030508@gmail.com> Message-ID: On Fri, Jun 10, 2011 at 10:03 AM, Bruce Southey wrote: > ** > > > I have following the multiple date/time discussions with some interest as > it is clear there is not 'one way' (perhaps it's Dutch). But, I do keep > coming back to Chris's concepts of time as a strict unit of measure and time > as a calender. So I do think that types of changes are rather premature > without defining a some base measurement of time - probably some thing like > Unix time or International Atomic Time (TAI) but not UTC due to leap seconds > (http://en.wikipedia.org/wiki/Leap_second). > I think this is an important issue, definitely. I proposed an idea for dealing with it here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056660.html where I got some feedback that using UTC by default would have enough compatibility issues to change it to POSIX time by default with a TAI mode selectable in the metadata. Leap seconds make using UTC rather problematic for a couple of reasons: > 1) It's essentially only historical. A range of the seconds in December > 2011 computed 'now' in June 2011 using UTC might be different than a range > calculated in a couple weeks if leaps seconds are added to December 2011. > 2) There is also the issue that 23:59:60 December 31, 2008 UTC is a valid > time but not for other years like 2009 and 2010. It also means that you have > to be careful of doing experiments that require accuracy of seconds or less > because a 1 second gap could be recorded as a 2 second gap. > With the TAI datetime metadata it should be possible to deal with these problems in a way. By specifying times further than 6 months in the future in TAI, the leap-second issue can be avoided for 1), and leap seconds like 2) and time differences around them will work with TAI. The conversion from TAI datetimes to/from UTC strings would include the :60 for added leap seconds. > The other issue is how do you define the np.arange step argument since that > can be in different scales such as month, years, seconds? Can a user > specific days and get half-days (like 1.5 days) or must these be 'integer' > days? > The datetime and timedelta are purely integers, so half-days are only possible by going to hours. I think this is good for datetime, but for timedelta it would be nicer to have a general physical quantities/units system in NumPy, so timedelta would be a regular integer or float with time as its metadata instead of having a timedelta type. That's a whole lot more work though, and doesn't seem like a short term thing... Cheers, Mark > > Bruce > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 10 18:58:06 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jun 2011 16:58:06 -0600 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: > > > On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >> >>> >>> >>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>> >>>>> Came across an odd error while using numpy master. Note, my system is >>>>> 32-bits. >>>>> >>>>> >>> import numpy as np >>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>> False >>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>> True >>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>> True >>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>> True >>>>> >>>>> So, only the summation performed with a np.int32 accumulator results in >>>>> a type that doesn't match the expected type. Now, for even more >>>>> strangeness: >>>>> >>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>> >>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>> '0x9599a0' >>>>> >>> hex(id(np.int32)) >>>>> '0x959a80' >>>>> >>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>> memory address is different from the memory address for np.int32. >>>>> >>>>> >>>> One of them is probably a long, print out the typecode, dtype.char. >>>> >>>> Chuck >>>> >>>> >>>> >>> Good intuition, but odd result... >>> >>> >>> import numpy as np >>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>> >>> b = np.int32(6) >>> >>> type(a) >>> >>> >>> type(b) >>> >>> >>> a.dtype.char >>> 'i' >>> >>> b.dtype.char >>> 'l' >>> >>> So, the standard np.int32 is getting listed as a long somehow? To >>> further investigate: >>> >>> >> Yes, long shifts around from int32 to int64 depending on the OS. For >> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >> On 32 bit systems it is 32 bits. >> >> Chuck >> >> > Right, that makes sense. But, the question is why does sum() put out a > result dtype that is not identical to the dtype that I requested, or even > the dtype of the input array? Could this be an indication of a bug > somewhere? Even if the bug is harmless (it was only noticed within the test > suite of larry), is this unexpected? > > I expect sum is using a ufunc and it acts differently on account of the cleanup of the ufunc casting rules. And yes, a long *is* int32 on your machine. On mine In [4]: dtype('q') # long long Out[4]: dtype('int64') In [5]: dtype('l') # long Out[5]: dtype('int64') The mapping from C types to numpy width types isn't 1-1. Personally, I think we should drop long ;) But it used to be the standard Python type in the C API. Mark has also pointed out the problems/confusion this ambiguity causes and someday we should probably think it out and fix it. But I don't think it is the most pressing problem. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 19:03:00 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 18:03:00 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <20D9BDDD-1CF4-4B73-BDD5-2D77A7AF4D01@gmail.com> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <20D9BDDD-1CF4-4B73-BDD5-2D77A7AF4D01@gmail.com> Message-ID: On Thu, Jun 9, 2011 at 1:44 AM, Pierre GM wrote: > > > > The fact that it's a NumPy dtype probably is the biggest limiting factor > preventing parameters like 'start' and 'end' during conversion. Having a > datetime represent an instant in time neatly removes any ambiguity, so > converting between days and seconds as a unit is analogous to converting > between int32 and float32. > > Back to Dave's example, '2011' is not necessarily the same instant in time > as '2011-01'. The conversion from low to high frequencies requires some > reference. In scikits.timeseries terms, we could assume that 'START' is the > reference. No problem with that, if we can have a way to extend that (eg, > through some metadata). I tried something like that in the experimental > version of scikits.timeseries I was babbling about earlier... > I don't think you would want to extend the datetime with more metadata, but rather use it as a tool to create the timeseries with. You could create a lightweight wrapper around datetime arrays which exposed a timeseries-oriented interface but used the datetime functionality to do the computations and provide good performance with large numbers of datetimes. I would love it if you have the time to experiment a bit with the https://github.com/m-paradox/numpy branch I'm developing on, so I could tweak the design or add features to make this easier based on your feedback from using the new data type. Cheers, Mark > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 19:16:54 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 18:16:54 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <4DF110AC.5050903@noaa.gov> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DF110AC.5050903@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 1:27 PM, Christopher Barker wrote: > Mark Wiebe wrote: > > Because datetime64 is a NumPy data type, it needs a well-defined rule > > for these kinds of conversions. Treating datetimes as moments in time > > instead of time intervals makes a very nice rule which appears to be > > very amenable to a variety of computations, which is why I like the > > approach. > > This really is the key issue that I've been harping on (sorry...) this > whole thread: > > For many uses, a datetime as a moment in time is a great abstraction, > and I think how most datetime implementations (like the std lib one) are > used. > > However, when you are trying to represent/work with data like monthly > averages and the like, you need something that represents something else > -- and trying to use the same mechanism as for time instants, and hoping > the the ambiguities will resolve themselves from the context is dangerous. > I'm trying to have clear, unambiguous meanings for all the operations on datetime types. There are issues that make things complicated, but I think they arise from the nature of dates and times. I don't work in finance, so I'm not sure about things like b-monthly > payments -- it seems those could well be defined as instances -- the > payments are due on a given day each month( say the 1st and 15th), and, > I assume that is well defined to the instant -- i.e. before the end of > the day in some time zone. (note that that would be hour time: > 23:59.99999, rather than, zero, however). The trick with these comes in > when you do math -- the timedelta issue -- what is a 1 month timedelta? > It's NOT an given number of days, hours, etc. > For datetimes, converting between months/years representation and days/hours/etc representations is well defined, but a non-linear relationship. For timedeltas, I've flagged such conversions as 'unsafe' according to the can_cast function, and this is related to my desire to tighten up NumPy's casting rules in general. I view this issue as the same as silent conversion of 3.72 to 3 if you assign to an integer array, and would like the thing that stops you from easily converting a month timedelta to a day timedelta be the same. When the cast is forced, it currently uses the average over 400 years (the leap-year period). I don't know that anyone has time to do this, but it seems a written up > set of use-cases would help focus this conversation -- I know I've > pretty lost of what uses we are trying to support. > > another question: > > can you instantiate a datetime64 with something other than a string? > i.e. a (year, [month], [day], [hour], [second], [usecond]) tuple? > The Python datetime.datetime, datetime,date, and datetime.timedelta objects convert to datetime64 and timedelta64. I can add more construction methods, but would like to get more people building the branch and trying it before doing that. Some feedback from user experience would be helpful to suggest what's necessary. > > The fact that it's a NumPy dtype probably is the biggest limiting > > factor preventing parameters like 'start' and 'end' during conversion. > > Having a datetime represent an instant in time neatly removes any > > ambiguity, so converting between days and seconds as a unit is > > analogous to converting between int32 and float32. > > Sure, but I don't know that that is the best way to go -- integers are > precisely defined and generally used as 3 == 3.00000000 That's not the > case for months, at least if it's supposed be be a monthly average-type > representation. > I don't think a monthly average-type representation is desirable for the the most primitive datetime type. For this I expect you would want a datetime and a timedelta, or a pair of datetimes, representing an interval of time. This reminds me a question recently on this list -- someone was using > np.histogram() to bin integer values, and was surprised at the results > -- what they needed to do was consider the bin intervals as floating > point numbers to get what they wanted: 0.5, 1.5, 2.5, rather than > 1,2,3,4, because what they really wanted was an categorical definition > of an integer, NOT a truncated floating point number. I'm not sure how > that informs this conversation, though... > > > > > >>> np.timedelta64(10, 's') + 10 > > > numpy.timedelta64(20,'s') > > > > Here, the unit is defined: 's' > > > > For the first operand, the inconsistency is with the second. Here's > > the reasoning I didn't spell out: > > > We're adding a timedelta + int, so lets convert 10 into a timedelta. > > No units specified, so it's > > 10 microseconds, so we add 10 seconds and 10 microseconds, not 10 > > seconds and 10 seconds. > > This sure seems ripe for error to me -- if a datetime and timedelta are > going to be represented in various possible units, then I don't think it > it's a good idea to allow one to and an integer -- especially if the > unit can be inferred from the input data, rather than specified. > > "Explicit is better than implicit." > > "In the face of ambiguity, refuse the temptation to guess." > > If you must allow this, then using the default for the unspecified unit > as above is the way to go. > There's always a balance between how easy it is to do things at the interactive prompt, and how tightly the system controls what you do. I'm trying to strike a balance, people will have to use it to see if I'm hitting the mark. > Dave Hirschfeld wrote: > >> Here are some current behaviors that are inconsistent with the > microsecond > > default, but consistent with the "generic time unit" idea: > >>>>> np.timedelta64(10, 's') + 10 > >> numpy.timedelta64(20,'s') > > > > That is what I would expect (and hope) would happen. IMO an integer > should be > > cast to the dtype ([s]) of the datetime/timedelta. > > This is way too ripe for error, particularly if we have the unit > auto-determined from input data. > > > > > Not to take us back to a probably already resolved issue, but maybe all > this unit conversion could and should be avoided by following the python > datetime approach -- all datetimes and timedeltas are always defined > with microsecond precision -- period. > > Maybe there are computational efficiencies that we want to avoid. > > This would also preclude any use of these dtypes for work that required > greater precision, but does anyone really need both year, month, day > specification AND nanoseconds? Given all the leap-second issues, that > seems a bit ridiculous. > > But it would make things easier. > > I note that in this entire conversation, all the talk has been about > finance examples -- I think I'm the only one that has brought up science > use, and that only barely (and mostly the simple cases). So do we really > need to have the same dtype useful for finance and particle physics? > I'm not sure, but I think it can work out quite well. More domain-specific feedback in other areas always helps. -Mark > > > -Chris > > > > > > > > > > > > > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 10 19:18:16 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 18:18:16 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <4DEF0A4E.7030509@noaa.gov> Message-ID: On Thu, Jun 9, 2011 at 3:17 PM, Wes McKinney wrote: > On Wed, Jun 8, 2011 at 8:53 PM, Mark Wiebe wrote: > > On Wed, Jun 8, 2011 at 4:57 AM, Wes McKinney > wrote: > >> > >> > >> > >> > >> So in summary, w.r.t. time series data and datetime, the only things I > >> care about from a datetime / pandas point of view: > >> > >> - Ability to easily define custom timedeltas > > > > Can you elaborate on this a bit? I'm guessing you're not referring to the > > timedelta64 in NumPy, which is simply an integer with an associated unit. > > I guess what I am thinking of may not need to be available at the > NumPy level. As an example, suppose you connected a timedelta > (DateOffset, in pandas parlance) to a set of holidays, so when you > say: > > date + BusinessDay(1, calendar='US') > If you have a moment of time, can you comment on the business day api I proposed, and possibly try out the busday_offset function I've begun? It doesn't do holidays yet, but the weekmask seems to work. -Mark > > then you have some custom logic which knows how to describe the result > of that. Some things along these lines have already been discussed-- > but this is the basic way that the date offsets work in pandas, i.e. > subclasses of DateOffset which implement custom logic. Probably too > high level for NumPy. > > >> > >> - Generate datetime objects, or some equivalent, which can be used to > >> back pandas data structures > > > > Do you mean mechanisms to generate sequences of datetime's? I'm fixing up > > arange to work reasonably with datetimes at the moment. > > Yes, that will be very nice. > > >> > >> - (possible now??) Ability to have a set of frequency-naive dates > >> (possibly not in order). > > > > This should work just as well as if you have an arbitrary set of integers > > specifying the locations of the sample points. > > Cheers, > > Mark > > Cool (!). > > >> > >> This last point actually matters. Suppose you wanted to get the worst > >> 5-performing days in the S&P 500 index: > >> > >> In [7]: spx.index > >> Out[7]: > >> > >> offset: <1 BusinessDay>, tzinfo: None > >> [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] > >> length: 2963 > >> > >> # but this is OK > >> In [8]: spx.order()[:5] > >> Out[8]: > >> 2008-10-15 00:00:00 -0.0903497960942 > >> 2008-12-01 00:00:00 -0.0892952780505 > >> 2008-09-29 00:00:00 -0.0878970494885 > >> 2008-10-09 00:00:00 -0.0761670761671 > >> 2008-11-20 00:00:00 -0.0671229140321 > >> > >> - W > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 19:19:41 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 19:19:41 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Charles R Harris > > > On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: > >> >> >> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>> charlesr.harris at gmail.com> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>> >>>>>> Came across an odd error while using numpy master. Note, my system is >>>>>> 32-bits. >>>>>> >>>>>> >>> import numpy as np >>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>> False >>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>> True >>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>> True >>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>> True >>>>>> >>>>>> So, only the summation performed with a np.int32 accumulator results >>>>>> in a type that doesn't match the expected type. Now, for even more >>>>>> strangeness: >>>>>> >>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>> >>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>> '0x9599a0' >>>>>> >>> hex(id(np.int32)) >>>>>> '0x959a80' >>>>>> >>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>> memory address is different from the memory address for np.int32. >>>>>> >>>>>> >>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>> >>>>> Chuck >>>>> >>>>> >>>>> >>>> Good intuition, but odd result... >>>> >>>> >>> import numpy as np >>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>> >>> b = np.int32(6) >>>> >>> type(a) >>>> >>>> >>> type(b) >>>> >>>> >>> a.dtype.char >>>> 'i' >>>> >>> b.dtype.char >>>> 'l' >>>> >>>> So, the standard np.int32 is getting listed as a long somehow? To >>>> further investigate: >>>> >>>> >>> Yes, long shifts around from int32 to int64 depending on the OS. For >>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>> On 32 bit systems it is 32 bits. >>> >>> Chuck >>> >>> >> Right, that makes sense. But, the question is why does sum() put out a >> result dtype that is not identical to the dtype that I requested, or even >> the dtype of the input array? Could this be an indication of a bug >> somewhere? Even if the bug is harmless (it was only noticed within the test >> suite of larry), is this unexpected? >> >> > I expect sum is using a ufunc and it acts differently on account of the > cleanup of the ufunc casting rules. And yes, a long *is* int32 on your > machine. On mine > > In [4]: dtype('q') # long long > Out[4]: dtype('int64') > > In [5]: dtype('l') # long > Out[5]: dtype('int64') > > The mapping from C types to numpy width types isn't 1-1. Personally, I > think we should drop long ;) But it used to be the standard Python type in > the C API. Mark has also pointed out the problems/confusion this ambiguity > causes and someday we should probably think it out and fix it. But I don't > think it is the most pressing problem. > > Chuck > > But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit computer where both are int32? -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Fri Jun 10 20:02:21 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Sat, 11 Jun 2011 02:02:21 +0200 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> <4DF23234.5030508@gmail.com> Message-ID: On Jun 11, 2011, at 12:57 AM, Mark Wiebe wrote: > The other issue is how do you define the np.arange step argument since that can be in different scales such as month, years, seconds? Can a user specific days and get half-days (like 1.5 days) or must these be 'integer' days? > > The datetime and timedelta are purely integers, so half-days are only possible by going to hours. I think this is good for datetime, but for timedelta it would be nicer to have a general physical quantities/units system in NumPy, so timedelta would be a regular integer or float with time as its metadata instead of having a timedelta type. That's a whole lot more work though, and doesn't seem like a short term thing... Better to leave it to a higher level specific package, I'd say (like more specific stats algorithms are in scipy or some scikits). From pgmdevlist at gmail.com Fri Jun 10 20:04:15 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Sat, 11 Jun 2011 02:04:15 +0200 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <20D9BDDD-1CF4-4B73-BDD5-2D77A7AF4D01@gmail.com> Message-ID: <8C3B05E3-FC83-4D29-B428-8F61BA6C94A7@gmail.com> On Jun 11, 2011, at 1:03 AM, Mark Wiebe wrote: > > I don't think you would want to extend the datetime with more metadata, but rather use it as a tool to create the timeseries with. You could create a lightweight wrapper around datetime arrays which exposed a timeseries-oriented interface but used the datetime functionality to do the computations and provide good performance with large numbers of datetimes. I would love it if you have the time to experiment a bit with the https://github.com/m-paradox/numpy branch I'm developing on, so I could tweak the design or add features to make this easier based on your feedback from using the new data type. I'll do my best, depending on my free time, alas... Did you have a chance to check scikits.timeseries (and its experimental branch) ? If you have any question about it, feel free to contact me offlist (less noise that way, we can always summarize the discussion). Cheers P. From charlesr.harris at gmail.com Fri Jun 10 21:35:30 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jun 2011 19:35:30 -0600 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: > 2011/6/10 Charles R Harris > >> >> >> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >> >>> >>> >>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>> charlesr.harris at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>> >>>>>>> Came across an odd error while using numpy master. Note, my system >>>>>>> is 32-bits. >>>>>>> >>>>>>> >>> import numpy as np >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>> False >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>> True >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>> True >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>> True >>>>>>> >>>>>>> So, only the summation performed with a np.int32 accumulator results >>>>>>> in a type that doesn't match the expected type. Now, for even more >>>>>>> strangeness: >>>>>>> >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>> >>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>> '0x9599a0' >>>>>>> >>> hex(id(np.int32)) >>>>>>> '0x959a80' >>>>>>> >>>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>>> memory address is different from the memory address for np.int32. >>>>>>> >>>>>>> >>>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>>> >>>>> Good intuition, but odd result... >>>>> >>>>> >>> import numpy as np >>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>> >>> b = np.int32(6) >>>>> >>> type(a) >>>>> >>>>> >>> type(b) >>>>> >>>>> >>> a.dtype.char >>>>> 'i' >>>>> >>> b.dtype.char >>>>> 'l' >>>>> >>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>> further investigate: >>>>> >>>>> >>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>> On 32 bit systems it is 32 bits. >>>> >>>> Chuck >>>> >>>> >>> Right, that makes sense. But, the question is why does sum() put out a >>> result dtype that is not identical to the dtype that I requested, or even >>> the dtype of the input array? Could this be an indication of a bug >>> somewhere? Even if the bug is harmless (it was only noticed within the test >>> suite of larry), is this unexpected? >>> >>> >> I expect sum is using a ufunc and it acts differently on account of the >> cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >> machine. On mine >> >> In [4]: dtype('q') # long long >> Out[4]: dtype('int64') >> >> In [5]: dtype('l') # long >> Out[5]: dtype('int64') >> >> The mapping from C types to numpy width types isn't 1-1. Personally, I >> think we should drop long ;) But it used to be the standard Python type in >> the C API. Mark has also pointed out the problems/confusion this ambiguity >> causes and someday we should probably think it out and fix it. But I don't >> think it is the most pressing problem. >> >> Chuck >> >> > But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit > computer where both are int32? > > Maybe yes, maybe no ;) They have different descriptors, so from numpy's perspective they are different, but at the hardware/precision level they are the same. It's more of a decision as to what != means in this case. Since numpy started as Numeric with only the c types the current behavior is consistent, but that doesn't mean it shouldn't change at some point. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 21:50:30 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 21:50:30 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Charles R Harris > > > On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: > >> 2011/6/10 Charles R Harris >> >>> >>> >>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>> charlesr.harris at gmail.com> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>> charlesr.harris at gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>> >>>>>>>> Came across an odd error while using numpy master. Note, my system >>>>>>>> is 32-bits. >>>>>>>> >>>>>>>> >>> import numpy as np >>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>> False >>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>> True >>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>> True >>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>> True >>>>>>>> >>>>>>>> So, only the summation performed with a np.int32 accumulator results >>>>>>>> in a type that doesn't match the expected type. Now, for even more >>>>>>>> strangeness: >>>>>>>> >>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>> >>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>> '0x9599a0' >>>>>>>> >>> hex(id(np.int32)) >>>>>>>> '0x959a80' >>>>>>>> >>>>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>>>> memory address is different from the memory address for np.int32. >>>>>>>> >>>>>>>> >>>>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>>>> >>>>>>> Chuck >>>>>>> >>>>>>> >>>>>>> >>>>>> Good intuition, but odd result... >>>>>> >>>>>> >>> import numpy as np >>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>> >>> b = np.int32(6) >>>>>> >>> type(a) >>>>>> >>>>>> >>> type(b) >>>>>> >>>>>> >>> a.dtype.char >>>>>> 'i' >>>>>> >>> b.dtype.char >>>>>> 'l' >>>>>> >>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>> further investigate: >>>>>> >>>>>> >>>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>>> On 32 bit systems it is 32 bits. >>>>> >>>>> Chuck >>>>> >>>>> >>>> Right, that makes sense. But, the question is why does sum() put out a >>>> result dtype that is not identical to the dtype that I requested, or even >>>> the dtype of the input array? Could this be an indication of a bug >>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>> suite of larry), is this unexpected? >>>> >>>> >>> I expect sum is using a ufunc and it acts differently on account of the >>> cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>> machine. On mine >>> >>> In [4]: dtype('q') # long long >>> Out[4]: dtype('int64') >>> >>> In [5]: dtype('l') # long >>> Out[5]: dtype('int64') >>> >>> The mapping from C types to numpy width types isn't 1-1. Personally, I >>> think we should drop long ;) But it used to be the standard Python type in >>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>> causes and someday we should probably think it out and fix it. But I don't >>> think it is the most pressing problem. >>> >>> Chuck >>> >>> >> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >> computer where both are int32? >> >> > Maybe yes, maybe no ;) They have different descriptors, so from numpy's > perspective they are different, but at the hardware/precision level they are > the same. It's more of a decision as to what != means in this case. Since > numpy started as Numeric with only the c types the current behavior is > consistent, but that doesn't mean it shouldn't change at some point. > > Chuck > Well apparently it was actually changed recently, since in Numpy 1.5.1 on a Windows 32 bit machine, they are considered equal with '=='. Personally I think if the string representation of two dtypes is "int32", then they should be ==, otherwise it wouldn't make much sense given that you can directly test the equality of a dtype with a string like "int32" (like dtype('i') == "int32" and dtype('l') == "int32"). -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Fri Jun 10 21:51:14 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Fri, 10 Jun 2011 18:51:14 -0700 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 6:35 PM, Charles R Harris wrote: > On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >> computer where both are int32? >> > > Maybe yes, maybe no ;) They have different descriptors, so from numpy's > perspective they are different, but at the hardware/precision level they are > the same. It's more of a decision as to what? != means in this case. Since > numpy started as Numeric with only the c types the current behavior is > consistent, but that doesn't mean it shouldn't change at some point. Maybe this is the same question, but are you maybe yes, maybe no on this too: >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 False Ben, what happens if you put an axis in there? Like >>> np.sum([[1, 2, 3], [4,5,6]], axis=0).dtype == np.int32 Just wondering if this is another different-dtype-for-different-axis case. From pav at iki.fi Fri Jun 10 22:00:24 2011 From: pav at iki.fi (Pauli Virtanen) Date: Sat, 11 Jun 2011 02:00:24 +0000 (UTC) Subject: [Numpy-discussion] numpy type mismatch References: Message-ID: On Fri, 10 Jun 2011 19:35:30 -0600, Charles R Harris wrote: [clip] > Maybe yes, maybe no ;) They have different descriptors, so from numpy's > perspective they are different, but at the hardware/precision level they > are the same. It's more of a decision as to what != means in this case. > Since numpy started as Numeric with only the c types the current > behavior is consistent, but that doesn't mean it shouldn't change at > some point. IMO, Numpy should not have unnecessary scalar Python types in play. Having dtypes with different equivalent chars is not so bad, since they compare equal -- the scalar types don't. This could possibly be resolved by adjusting multiarraymodule so that it registers only one scalar type per size, according to what it sees at compile time. One issue is that the scalar types are global structs, and not pointers, so it's not so easy to make one point to another. Maybe they can be memcpy'd to make aliases, although that sounds dirty... -- Pauli Virtanen From pav at iki.fi Fri Jun 10 22:03:53 2011 From: pav at iki.fi (Pauli Virtanen) Date: Sat, 11 Jun 2011 02:03:53 +0000 (UTC) Subject: [Numpy-discussion] numpy type mismatch References: Message-ID: On Fri, 10 Jun 2011 18:51:14 -0700, Keith Goodman wrote: [clip] > Maybe this is the same question, but are you maybe yes, maybe no on this > too: > > >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 > False Note that this is a comparison between two Python types... > Ben, what happens if you put an axis in there? Like > > >>> np.sum([[1, 2, 3], [4,5,6]], axis=0).dtype == np.int32 > > Just wondering if this is another different-dtype-for-different-axis > case. ... and this between Numpy dtypes (the RHS operand is cast to a dtype). The confusion comes from the fact that Numpy registers two Python types, one of them corresponding to 'l' and one to 'i', but both print as "int32" so it's difficult to tell which one you have. They behave exactly the same way everywhere, except that they do not compare equal... From shish at keba.be Fri Jun 10 22:29:53 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 22:29:53 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Olivier Delalleau > 2011/6/10 Charles R Harris > >> >> >> On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >> >>> 2011/6/10 Charles R Harris >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>>> charlesr.harris at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>>> >>>>>>>>> Came across an odd error while using numpy master. Note, my system >>>>>>>>> is 32-bits. >>>>>>>>> >>>>>>>>> >>> import numpy as np >>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>>> False >>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>>> True >>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>>> True >>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>>> True >>>>>>>>> >>>>>>>>> So, only the summation performed with a np.int32 accumulator >>>>>>>>> results in a type that doesn't match the expected type. Now, for even more >>>>>>>>> strangeness: >>>>>>>>> >>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>>> >>>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>>> '0x9599a0' >>>>>>>>> >>> hex(id(np.int32)) >>>>>>>>> '0x959a80' >>>>>>>>> >>>>>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>>>>> memory address is different from the memory address for np.int32. >>>>>>>>> >>>>>>>>> >>>>>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>>>>> >>>>>>>> Chuck >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Good intuition, but odd result... >>>>>>> >>>>>>> >>> import numpy as np >>>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>>> >>> b = np.int32(6) >>>>>>> >>> type(a) >>>>>>> >>>>>>> >>> type(b) >>>>>>> >>>>>>> >>> a.dtype.char >>>>>>> 'i' >>>>>>> >>> b.dtype.char >>>>>>> 'l' >>>>>>> >>>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>>> further investigate: >>>>>>> >>>>>>> >>>>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>>>> On 32 bit systems it is 32 bits. >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>> Right, that makes sense. But, the question is why does sum() put out a >>>>> result dtype that is not identical to the dtype that I requested, or even >>>>> the dtype of the input array? Could this be an indication of a bug >>>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>>> suite of larry), is this unexpected? >>>>> >>>>> >>>> I expect sum is using a ufunc and it acts differently on account of the >>>> cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>>> machine. On mine >>>> >>>> In [4]: dtype('q') # long long >>>> Out[4]: dtype('int64') >>>> >>>> In [5]: dtype('l') # long >>>> Out[5]: dtype('int64') >>>> >>>> The mapping from C types to numpy width types isn't 1-1. Personally, I >>>> think we should drop long ;) But it used to be the standard Python type in >>>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>>> causes and someday we should probably think it out and fix it. But I don't >>>> think it is the most pressing problem. >>>> >>>> Chuck >>>> >>>> >>> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >>> computer where both are int32? >>> >>> >> Maybe yes, maybe no ;) They have different descriptors, so from numpy's >> perspective they are different, but at the hardware/precision level they are >> the same. It's more of a decision as to what != means in this case. Since >> numpy started as Numeric with only the c types the current behavior is >> consistent, but that doesn't mean it shouldn't change at some point. >> >> Chuck >> > > Well apparently it was actually changed recently, since in Numpy 1.5.1 on a > Windows 32 bit machine, they are considered equal with '=='. > Personally I think if the string representation of two dtypes is "int32", > then they should be ==, otherwise it wouldn't make much sense given that you > can directly test the equality of a dtype with a string like "int32" (like > dtype('i') == "int32" and dtype('l') == "int32"). > I also just checked on a fresh install of numpy 1.6.0 on python 3.2, and both types are equal as well. I've been playing quite a bit with numpy dtypes and it's the first time I hear two dtypes representing the exact same kind of data do not compare equal, so I'm still enclined to believe it should be considered a bug. -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 22:55:40 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 21:55:40 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 8:51 PM, Keith Goodman wrote: > On Fri, Jun 10, 2011 at 6:35 PM, Charles R Harris > wrote: > > On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau > wrote: > > >> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit > >> computer where both are int32? > >> > > > > Maybe yes, maybe no ;) They have different descriptors, so from numpy's > > perspective they are different, but at the hardware/precision level they > are > > the same. It's more of a decision as to what != means in this case. > Since > > numpy started as Numeric with only the c types the current behavior is > > consistent, but that doesn't mean it shouldn't change at some point. > > Maybe this is the same question, but are you maybe yes, maybe no on this > too: > > >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 > False > > Ben, what happens if you put an axis in there? Like > > >>> np.sum([[1, 2, 3], [4,5,6]], axis=0).dtype == np.int32 > The same thing happens as before. > > Just wondering if this is another different-dtype-for-different-axis case. > No, I think Chuck has it right and that this is the result of the recent cleanup work for ufunc casting rules. However, I am so entirely unfamiliar with ufuncs that I really don't know how to investigate this. Can we get Mark Wiebe's opinion on this? I suspect he might recognize the problem right away. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 22:59:11 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 21:59:11 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 9:03 PM, Pauli Virtanen wrote: > On Fri, 10 Jun 2011 18:51:14 -0700, Keith Goodman wrote: > [clip] > > Maybe this is the same question, but are you maybe yes, maybe no on this > > too: > > > > >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 > > False > > Note that this is a comparison between two Python types... > > > Ben, what happens if you put an axis in there? Like > > > > >>> np.sum([[1, 2, 3], [4,5,6]], axis=0).dtype == np.int32 > > > > Just wondering if this is another different-dtype-for-different-axis > > case. > > ... and this between Numpy dtypes (the RHS operand is cast to a dtype). > > And just to note for my previous response, I did compare the python types, not the dtypes. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 23:10:51 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 22:10:51 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 9:29 PM, Olivier Delalleau wrote: > > 2011/6/10 Olivier Delalleau > >> 2011/6/10 Charles R Harris >> >>> >>> >>> On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >>> >>>> 2011/6/10 Charles R Harris >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>>>> charlesr.harris at gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>>>> >>>>>>>>>> Came across an odd error while using numpy master. Note, my >>>>>>>>>> system is 32-bits. >>>>>>>>>> >>>>>>>>>> >>> import numpy as np >>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>>>> False >>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>>>> True >>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>>>> True >>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>>>> True >>>>>>>>>> >>>>>>>>>> So, only the summation performed with a np.int32 accumulator >>>>>>>>>> results in a type that doesn't match the expected type. Now, for even more >>>>>>>>>> strangeness: >>>>>>>>>> >>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>>>> >>>>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>>>> '0x9599a0' >>>>>>>>>> >>> hex(id(np.int32)) >>>>>>>>>> '0x959a80' >>>>>>>>>> >>>>>>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>>>>>> memory address is different from the memory address for np.int32. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>>>>>> >>>>>>>>> Chuck >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Good intuition, but odd result... >>>>>>>> >>>>>>>> >>> import numpy as np >>>>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>>>> >>> b = np.int32(6) >>>>>>>> >>> type(a) >>>>>>>> >>>>>>>> >>> type(b) >>>>>>>> >>>>>>>> >>> a.dtype.char >>>>>>>> 'i' >>>>>>>> >>> b.dtype.char >>>>>>>> 'l' >>>>>>>> >>>>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>>>> further investigate: >>>>>>>> >>>>>>>> >>>>>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>>>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>>>>> On 32 bit systems it is 32 bits. >>>>>>> >>>>>>> Chuck >>>>>>> >>>>>>> >>>>>> Right, that makes sense. But, the question is why does sum() put out >>>>>> a result dtype that is not identical to the dtype that I requested, or even >>>>>> the dtype of the input array? Could this be an indication of a bug >>>>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>>>> suite of larry), is this unexpected? >>>>>> >>>>>> >>>>> I expect sum is using a ufunc and it acts differently on account of the >>>>> cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>>>> machine. On mine >>>>> >>>>> In [4]: dtype('q') # long long >>>>> Out[4]: dtype('int64') >>>>> >>>>> In [5]: dtype('l') # long >>>>> Out[5]: dtype('int64') >>>>> >>>>> The mapping from C types to numpy width types isn't 1-1. Personally, I >>>>> think we should drop long ;) But it used to be the standard Python type in >>>>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>>>> causes and someday we should probably think it out and fix it. But I don't >>>>> think it is the most pressing problem. >>>>> >>>>> Chuck >>>>> >>>>> >>>> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >>>> computer where both are int32? >>>> >>>> >>> Maybe yes, maybe no ;) They have different descriptors, so from numpy's >>> perspective they are different, but at the hardware/precision level they are >>> the same. It's more of a decision as to what != means in this case. Since >>> numpy started as Numeric with only the c types the current behavior is >>> consistent, but that doesn't mean it shouldn't change at some point. >>> >>> Chuck >>> >> >> Well apparently it was actually changed recently, since in Numpy 1.5.1 on >> a Windows 32 bit machine, they are considered equal with '=='. >> Personally I think if the string representation of two dtypes is "int32", >> then they should be ==, otherwise it wouldn't make much sense given that you >> can directly test the equality of a dtype with a string like "int32" (like >> dtype('i') == "int32" and dtype('l') == "int32"). >> > > I also just checked on a fresh install of numpy 1.6.0 on python 3.2, and > both types are equal as well. > Are you talking about the release of 1.6, or the continued development branch? This is happening to me on the master branch, but I have not tried earlier versions. Again, I think this bolsters the evidence that this is from a (very) recent change. > I've been playing quite a bit with numpy dtypes and it's the first time I > hear two dtypes representing the exact same kind of data do not compare > equal, so I'm still enclined to believe it should be considered a bug. > > Quite honestly, I really don't care that the dtypes aren't equal. I usually work at a purely python level and performing actions based on types is generally bad practice anyway. Anytime that I (rarely) check types, I would use isinstance() against one of the core numerical types rather than a numpy type. The fact that I even found this issue was completely by accident while investigating a test failure in larry. What concerns me more is that the type coming from the ufunc is not the same type that went in, or even requested through the dtype argument. I think *that* should be the main concern here, and should probably be tested for in the unit tests. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 10 23:34:05 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 10 Jun 2011 23:34:05 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Benjamin Root > > > On Fri, Jun 10, 2011 at 9:29 PM, Olivier Delalleau wrote: > >> >> 2011/6/10 Olivier Delalleau >> >>> 2011/6/10 Charles R Harris >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >>>> >>>>> 2011/6/10 Charles R Harris >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>>>>> >>>>>>>>>>> Came across an odd error while using numpy master. Note, my >>>>>>>>>>> system is 32-bits. >>>>>>>>>>> >>>>>>>>>>> >>> import numpy as np >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>>>>> False >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>>>>> True >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>>>>> True >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>>>>> True >>>>>>>>>>> >>>>>>>>>>> So, only the summation performed with a np.int32 accumulator >>>>>>>>>>> results in a type that doesn't match the expected type. Now, for even more >>>>>>>>>>> strangeness: >>>>>>>>>>> >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>>>>> >>>>>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>>>>> '0x9599a0' >>>>>>>>>>> >>> hex(id(np.int32)) >>>>>>>>>>> '0x959a80' >>>>>>>>>>> >>>>>>>>>>> So, the type from the sum() reports itself as a numpy int, but >>>>>>>>>>> its memory address is different from the memory address for np.int32. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> One of them is probably a long, print out the typecode, >>>>>>>>>> dtype.char. >>>>>>>>>> >>>>>>>>>> Chuck >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Good intuition, but odd result... >>>>>>>>> >>>>>>>>> >>> import numpy as np >>>>>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>>>>> >>> b = np.int32(6) >>>>>>>>> >>> type(a) >>>>>>>>> >>>>>>>>> >>> type(b) >>>>>>>>> >>>>>>>>> >>> a.dtype.char >>>>>>>>> 'i' >>>>>>>>> >>> b.dtype.char >>>>>>>>> 'l' >>>>>>>>> >>>>>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>>>>> further investigate: >>>>>>>>> >>>>>>>>> >>>>>>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>>>>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>>>>>> On 32 bit systems it is 32 bits. >>>>>>>> >>>>>>>> Chuck >>>>>>>> >>>>>>>> >>>>>>> Right, that makes sense. But, the question is why does sum() put out >>>>>>> a result dtype that is not identical to the dtype that I requested, or even >>>>>>> the dtype of the input array? Could this be an indication of a bug >>>>>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>>>>> suite of larry), is this unexpected? >>>>>>> >>>>>>> >>>>>> I expect sum is using a ufunc and it acts differently on account of >>>>>> the cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>>>>> machine. On mine >>>>>> >>>>>> In [4]: dtype('q') # long long >>>>>> Out[4]: dtype('int64') >>>>>> >>>>>> In [5]: dtype('l') # long >>>>>> Out[5]: dtype('int64') >>>>>> >>>>>> The mapping from C types to numpy width types isn't 1-1. Personally, I >>>>>> think we should drop long ;) But it used to be the standard Python type in >>>>>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>>>>> causes and someday we should probably think it out and fix it. But I don't >>>>>> think it is the most pressing problem. >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >>>>> computer where both are int32? >>>>> >>>>> >>>> Maybe yes, maybe no ;) They have different descriptors, so from numpy's >>>> perspective they are different, but at the hardware/precision level they are >>>> the same. It's more of a decision as to what != means in this case. Since >>>> numpy started as Numeric with only the c types the current behavior is >>>> consistent, but that doesn't mean it shouldn't change at some point. >>>> >>>> Chuck >>>> >>> >>> Well apparently it was actually changed recently, since in Numpy 1.5.1 on >>> a Windows 32 bit machine, they are considered equal with '=='. >>> Personally I think if the string representation of two dtypes is "int32", >>> then they should be ==, otherwise it wouldn't make much sense given that you >>> can directly test the equality of a dtype with a string like "int32" (like >>> dtype('i') == "int32" and dtype('l') == "int32"). >>> >> >> I also just checked on a fresh install of numpy 1.6.0 on python 3.2, and >> both types are equal as well. >> > > Are you talking about the release of 1.6, or the continued development > branch? This is happening to me on the master branch, but I have not tried > earlier versions. Again, I think this bolsters the evidence that this is > from a (very) recent change. > > >> I've been playing quite a bit with numpy dtypes and it's the first time I >> hear two dtypes representing the exact same kind of data do not compare >> equal, so I'm still enclined to believe it should be considered a bug. >> >> > Quite honestly, I really don't care that the dtypes aren't equal. I > usually work at a purely python level and performing actions based on types > is generally bad practice anyway. Anytime that I (rarely) check types, I > would use isinstance() against one of the core numerical types rather than a > numpy type. The fact that I even found this issue was completely by > accident while investigating a test failure in larry. > > What concerns me more is that the type coming from the ufunc is not the > same type that went in, or even requested through the dtype argument. I > think *that* should be the main concern here, and should probably be tested > for in the unit tests. > > Ben Root > The project I'm working on (http://deeplearning.net/software/theano/) heavily relies on dtype.__eq__, because it uses typed objects associated to data of e.g. int32 or float64 types, and it needs to know if the provided numpy arrays are of the proper type. So we do a lot of comparisons like: array.dtype == "int32" I'd be curious to know, in your case, what is the output of the following lines: numpy.dtype('i') == "int32" numpy.dtype('l') == "int32" str(numpy.dtype('i')) str(numpy.dtype('l')) -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 10 23:56:04 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jun 2011 21:56:04 -0600 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 9:10 PM, Benjamin Root wrote: > > > On Fri, Jun 10, 2011 at 9:29 PM, Olivier Delalleau wrote: > >> >> 2011/6/10 Olivier Delalleau >> >>> 2011/6/10 Charles R Harris >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >>>> >>>>> 2011/6/10 Charles R Harris >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>>>>> >>>>>>>>>>> Came across an odd error while using numpy master. Note, my >>>>>>>>>>> system is 32-bits. >>>>>>>>>>> >>>>>>>>>>> >>> import numpy as np >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>>>>> False >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>>>>> True >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>>>>> True >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>>>>> True >>>>>>>>>>> >>>>>>>>>>> So, only the summation performed with a np.int32 accumulator >>>>>>>>>>> results in a type that doesn't match the expected type. Now, for even more >>>>>>>>>>> strangeness: >>>>>>>>>>> >>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>>>>> >>>>>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>>>>> '0x9599a0' >>>>>>>>>>> >>> hex(id(np.int32)) >>>>>>>>>>> '0x959a80' >>>>>>>>>>> >>>>>>>>>>> So, the type from the sum() reports itself as a numpy int, but >>>>>>>>>>> its memory address is different from the memory address for np.int32. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> One of them is probably a long, print out the typecode, >>>>>>>>>> dtype.char. >>>>>>>>>> >>>>>>>>>> Chuck >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Good intuition, but odd result... >>>>>>>>> >>>>>>>>> >>> import numpy as np >>>>>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>>>>> >>> b = np.int32(6) >>>>>>>>> >>> type(a) >>>>>>>>> >>>>>>>>> >>> type(b) >>>>>>>>> >>>>>>>>> >>> a.dtype.char >>>>>>>>> 'i' >>>>>>>>> >>> b.dtype.char >>>>>>>>> 'l' >>>>>>>>> >>>>>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>>>>> further investigate: >>>>>>>>> >>>>>>>>> >>>>>>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>>>>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>>>>>> On 32 bit systems it is 32 bits. >>>>>>>> >>>>>>>> Chuck >>>>>>>> >>>>>>>> >>>>>>> Right, that makes sense. But, the question is why does sum() put out >>>>>>> a result dtype that is not identical to the dtype that I requested, or even >>>>>>> the dtype of the input array? Could this be an indication of a bug >>>>>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>>>>> suite of larry), is this unexpected? >>>>>>> >>>>>>> >>>>>> I expect sum is using a ufunc and it acts differently on account of >>>>>> the cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>>>>> machine. On mine >>>>>> >>>>>> In [4]: dtype('q') # long long >>>>>> Out[4]: dtype('int64') >>>>>> >>>>>> In [5]: dtype('l') # long >>>>>> Out[5]: dtype('int64') >>>>>> >>>>>> The mapping from C types to numpy width types isn't 1-1. Personally, I >>>>>> think we should drop long ;) But it used to be the standard Python type in >>>>>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>>>>> causes and someday we should probably think it out and fix it. But I don't >>>>>> think it is the most pressing problem. >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >>>>> computer where both are int32? >>>>> >>>>> >>>> Maybe yes, maybe no ;) They have different descriptors, so from numpy's >>>> perspective they are different, but at the hardware/precision level they are >>>> the same. It's more of a decision as to what != means in this case. Since >>>> numpy started as Numeric with only the c types the current behavior is >>>> consistent, but that doesn't mean it shouldn't change at some point. >>>> >>>> Chuck >>>> >>> >>> Well apparently it was actually changed recently, since in Numpy 1.5.1 on >>> a Windows 32 bit machine, they are considered equal with '=='. >>> Personally I think if the string representation of two dtypes is "int32", >>> then they should be ==, otherwise it wouldn't make much sense given that you >>> can directly test the equality of a dtype with a string like "int32" (like >>> dtype('i') == "int32" and dtype('l') == "int32"). >>> >> >> I also just checked on a fresh install of numpy 1.6.0 on python 3.2, and >> both types are equal as well. >> > > Are you talking about the release of 1.6, or the continued development > branch? This is happening to me on the master branch, but I have not tried > earlier versions. Again, I think this bolsters the evidence that this is > from a (very) recent change. > > >> I've been playing quite a bit with numpy dtypes and it's the first time I >> hear two dtypes representing the exact same kind of data do not compare >> equal, so I'm still enclined to believe it should be considered a bug. >> >> > Quite honestly, I really don't care that the dtypes aren't equal. I > usually work at a purely python level and performing actions based on types > is generally bad practice anyway. Anytime that I (rarely) check types, I > would use isinstance() against one of the core numerical types rather than a > numpy type. The fact that I even found this issue was completely by > accident while investigating a test failure in larry. > > What concerns me more is that the type coming from the ufunc is not the > same type that went in, or even requested through the dtype argument. I > think *that* should be the main concern here, and should probably be tested > for in the unit tests. > > To be a bit more explicit: In [3]: np.sum([1, 2, 3], dtype='q').dtype.char Out[3]: 'l' In [4]: np.sum([1, 2, 3], dtype='l').dtype.char Out[4]: 'l' Note that there were previous oddities, for instance the returned type for a + b would not necessarily be the same as for b + a, even though the precisions would be the same. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 10 23:57:37 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 10 Jun 2011 22:57:37 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 10:34 PM, Olivier Delalleau wrote: > 2011/6/10 Benjamin Root > >> >> >> On Fri, Jun 10, 2011 at 9:29 PM, Olivier Delalleau wrote: >> >>> >>> 2011/6/10 Olivier Delalleau >>> >>>> 2011/6/10 Charles R Harris >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau wrote: >>>>> >>>>>> 2011/6/10 Charles R Harris >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>>>>>>> charlesr.harris at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>>>>>>> >>>>>>>>>>>> Came across an odd error while using numpy master. Note, my >>>>>>>>>>>> system is 32-bits. >>>>>>>>>>>> >>>>>>>>>>>> >>> import numpy as np >>>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>>>>>>> False >>>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>>>>>>> True >>>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>>>>>>> True >>>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>>>>>>> True >>>>>>>>>>>> >>>>>>>>>>>> So, only the summation performed with a np.int32 accumulator >>>>>>>>>>>> results in a type that doesn't match the expected type. Now, for even more >>>>>>>>>>>> strangeness: >>>>>>>>>>>> >>>>>>>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>>>>>>> >>>>>>>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>>>>>>> '0x9599a0' >>>>>>>>>>>> >>> hex(id(np.int32)) >>>>>>>>>>>> '0x959a80' >>>>>>>>>>>> >>>>>>>>>>>> So, the type from the sum() reports itself as a numpy int, but >>>>>>>>>>>> its memory address is different from the memory address for np.int32. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> One of them is probably a long, print out the typecode, >>>>>>>>>>> dtype.char. >>>>>>>>>>> >>>>>>>>>>> Chuck >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Good intuition, but odd result... >>>>>>>>>> >>>>>>>>>> >>> import numpy as np >>>>>>>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>>>>>>> >>> b = np.int32(6) >>>>>>>>>> >>> type(a) >>>>>>>>>> >>>>>>>>>> >>> type(b) >>>>>>>>>> >>>>>>>>>> >>> a.dtype.char >>>>>>>>>> 'i' >>>>>>>>>> >>> b.dtype.char >>>>>>>>>> 'l' >>>>>>>>>> >>>>>>>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>>>>>>> further investigate: >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Yes, long shifts around from int32 to int64 depending on the OS. >>>>>>>>> For instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 >>>>>>>>> bits. On 32 bit systems it is 32 bits. >>>>>>>>> >>>>>>>>> Chuck >>>>>>>>> >>>>>>>>> >>>>>>>> Right, that makes sense. But, the question is why does sum() put >>>>>>>> out a result dtype that is not identical to the dtype that I requested, or >>>>>>>> even the dtype of the input array? Could this be an indication of a bug >>>>>>>> somewhere? Even if the bug is harmless (it was only noticed within the test >>>>>>>> suite of larry), is this unexpected? >>>>>>>> >>>>>>>> >>>>>>> I expect sum is using a ufunc and it acts differently on account of >>>>>>> the cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >>>>>>> machine. On mine >>>>>>> >>>>>>> In [4]: dtype('q') # long long >>>>>>> Out[4]: dtype('int64') >>>>>>> >>>>>>> In [5]: dtype('l') # long >>>>>>> Out[5]: dtype('int64') >>>>>>> >>>>>>> The mapping from C types to numpy width types isn't 1-1. Personally, >>>>>>> I think we should drop long ;) But it used to be the standard Python type in >>>>>>> the C API. Mark has also pointed out the problems/confusion this ambiguity >>>>>>> causes and someday we should probably think it out and fix it. But I don't >>>>>>> think it is the most pressing problem. >>>>>>> >>>>>>> Chuck >>>>>>> >>>>>>> >>>>>> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >>>>>> computer where both are int32? >>>>>> >>>>>> >>>>> Maybe yes, maybe no ;) They have different descriptors, so from numpy's >>>>> perspective they are different, but at the hardware/precision level they are >>>>> the same. It's more of a decision as to what != means in this case. Since >>>>> numpy started as Numeric with only the c types the current behavior is >>>>> consistent, but that doesn't mean it shouldn't change at some point. >>>>> >>>>> Chuck >>>>> >>>> >>>> Well apparently it was actually changed recently, since in Numpy 1.5.1 >>>> on a Windows 32 bit machine, they are considered equal with '=='. >>>> Personally I think if the string representation of two dtypes is >>>> "int32", then they should be ==, otherwise it wouldn't make much sense given >>>> that you can directly test the equality of a dtype with a string like >>>> "int32" (like dtype('i') == "int32" and dtype('l') == "int32"). >>>> >>> >>> I also just checked on a fresh install of numpy 1.6.0 on python 3.2, and >>> both types are equal as well. >>> >> >> Are you talking about the release of 1.6, or the continued development >> branch? This is happening to me on the master branch, but I have not tried >> earlier versions. Again, I think this bolsters the evidence that this is >> from a (very) recent change. >> >> >>> I've been playing quite a bit with numpy dtypes and it's the first time I >>> hear two dtypes representing the exact same kind of data do not compare >>> equal, so I'm still enclined to believe it should be considered a bug. >>> >>> >> Quite honestly, I really don't care that the dtypes aren't equal. I >> usually work at a purely python level and performing actions based on types >> is generally bad practice anyway. Anytime that I (rarely) check types, I >> would use isinstance() against one of the core numerical types rather than a >> numpy type. The fact that I even found this issue was completely by >> accident while investigating a test failure in larry. >> >> What concerns me more is that the type coming from the ufunc is not the >> same type that went in, or even requested through the dtype argument. I >> think *that* should be the main concern here, and should probably be tested >> for in the unit tests. >> >> Ben Root >> > > The project I'm working on (http://deeplearning.net/software/theano/) > heavily relies on dtype.__eq__, because it uses typed objects associated to > data of e.g. int32 or float64 types, and it needs to know if the provided > numpy arrays are of the proper type. > So we do a lot of comparisons like: > array.dtype == "int32" > > I'd be curious to know, in your case, what is the output of the following > lines: > numpy.dtype('i') == "int32" > numpy.dtype('l') == "int32" > str(numpy.dtype('i')) > str(numpy.dtype('l')) > > My output on a 32-bit, Ubuntu 11.04 machine with the latest numpy from master is: True True 'int32' 'int32' I will have a new 64-bit machine up and running on Monday (yay!) to do some further tests on, but I suspect I am currently in the minority here for architecture type. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 11 00:00:55 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 10 Jun 2011 23:00:55 -0500 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: On Fri, Jun 10, 2011 at 9:55 PM, Benjamin Root wrote: > > > On Fri, Jun 10, 2011 at 8:51 PM, Keith Goodman wrote: > >> On Fri, Jun 10, 2011 at 6:35 PM, Charles R Harris >> wrote: >> > On Fri, Jun 10, 2011 at 5:19 PM, Olivier Delalleau >> wrote: >> >> >> But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit >> >> computer where both are int32? >> >> >> > >> > Maybe yes, maybe no ;) They have different descriptors, so from numpy's >> > perspective they are different, but at the hardware/precision level they >> are >> > the same. It's more of a decision as to what != means in this case. >> Since >> > numpy started as Numeric with only the c types the current behavior is >> > consistent, but that doesn't mean it shouldn't change at some point. >> >> Maybe this is the same question, but are you maybe yes, maybe no on this >> too: >> >> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >> False >> >> Ben, what happens if you put an axis in there? Like >> >> >>> np.sum([[1, 2, 3], [4,5,6]], axis=0).dtype == np.int32 >> > > The same thing happens as before. > > >> >> Just wondering if this is another different-dtype-for-different-axis case. >> > > No, I think Chuck has it right and that this is the result of the recent > cleanup work for ufunc casting rules. However, I am so entirely unfamiliar > with ufuncs that I really don't know how to investigate this. Can we get > Mark Wiebe's opinion on this? I suspect he might recognize the problem > right away. > Yeah, it's basically a misfeature of NumPy which is apparently inherited from Numeric. The type resolution changes to 1.6 changed where this would pop up, but it's always been there in the code. I suspect that eliminating the long type, and making it an alias to either int or longlong depending on the platform, would fix all of this on all current platforms. Cheers, Mark > > Ben Root > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 12 04:32:48 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 12 Jun 2011 10:32:48 +0200 Subject: [Numpy-discussion] code review for datetime arange In-Reply-To: References: <4DF13004.5090304@noaa.gov> Message-ID: On Fri, Jun 10, 2011 at 4:18 PM, Mark Wiebe wrote: > On Fri, Jun 10, 2011 at 12:56 AM, Ralf Gommers < > ralf.gommers at googlemail.com> wrote: > >> >> Maybe I'm misunderstanding this, and once you make a function work for >> datetime it would also work for other new dtypes. But my impression is that >> that's not the case. Let's say I make a new dtype with distance instead of >> time attached. Would I be able to use it with arange, or would I have to go >> in and change the arange implementation again to support it? >> > > Ok, I think I understand the point you're driving at now. We need NumPy to > have the flexibility so that external plugins defining custom data types can > do the same thing that datetime does, having datetime be special compared to > those is undesirable. This I wholeheartedly agree with, and the way I'm > coding datetime is driving in that direction. > > The state of the NumPy codebase, however, prevents jumping straight to such > a solution, since there are several mechanisms and layers of such > abstraction already which themselves do not satisfy the needs of datetime > and other similar types. Also, arange is just one of a large number of > functions which could be extended individually for different types, for > example if one wanted to make a unit quaternion data type for manipulating > rotations, its needs would be significantly different. Because I can't see > the big picture from within the world of datetime, and because such > generalization takes a great amount of effort, I'm instead making these > changes minimally invasive on the current codebase. > > This approach is along the lines of "lifting" in generic programming. First > you write your algorithm in the specific domain, and work out how it behaves > there. Then, you determine what are the minimal requirements of the types > involved, and abstract the algorithm appropriately. Jumping to the second > step directly is generally too difficult, an incremental path to the final > goal must be used. > > http://www.generic-programming.org/about/intro/lifting.php > > Thanks for explaining, this makes sense to me now. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 12 04:48:49 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 12 Jun 2011 10:48:49 +0200 Subject: [Numpy-discussion] numpy build: automatic fortran detection In-Reply-To: References: Message-ID: On Thu, Jun 9, 2011 at 11:46 PM, Russell E. Owen wrote: > What would it take to automatically detect which flavor of fortran to > use to build numpy on linux? > You want to figure out which compiler was used to build BLAS/LAPACK/ATLAS and check that the numpy build uses the same, right? Assuming you only care about g77/gfortran, you can try this (from http://docs.scipy.org/doc/numpy/user/install.html): """ How to check the ABI of blas/lapack/atlas ----------------------------------------- One relatively simple and reliable way to check for the compiler used to build a library is to use ldd on the library. If libg2c.so is a dependency, this means that g77 has been used. If libgfortran.so is a a dependency, gfortran has been used. If both are dependencies, this means both have been used, which is almost always a very bad idea. """ You could do something similar for other compilers if needed. It would help to know exactly what problem you are trying to solve. > The unit tests are clever enough to detect a mis-build (though > surprisingly that is not done as part of the build process), so surely > it can be done. > The test suite takes some time to run. It would be very annoying if it ran by default on every rebuild. It's easy to write a build script that builds numpy, then runs the tests, if that's what you need. Cheers, Ralf > Even if there is no interest in putting this into numpy's setup.py, we > have a need for it for our own system. Any suggestions on how to do this > robustly would be much appreciated. > > -- Russell > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chaoyuejoy at gmail.com Sun Jun 12 08:10:41 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Sun, 12 Jun 2011 14:10:41 +0200 Subject: [Numpy-discussion] ndarray display in ipython Message-ID: Dear all pythoners, Does anybody know how I can choose the default display style for the data? like in the following case, the data are display in a scientific notation if they are ver small. another question is, the display fold the data in first row into several rows if there is still space on my computer screen, how can I adjust it to a more viewer friendly way? In [97]: npdata array([[ 2.00300000e+03, 4.44500000e+01, 3.56600000e+02, 8.73500000e+02, 3.12200000e+02, 5.17000000e+02, 1.37100000e+03, 4.54600000e+02, 9.68400000e+03, 1.89500000e+00, 3.97300000e+03, 2.04500000e+02, 3.10200000e+02, 1.19940000e+04, 2.76800000e+03, 1.14100000e+03, 1.00000000e+00], [ 2.00400000e+03, 2.63600000e+01, 3.11400000e+02, 7.65100000e+02, 2.85100000e+02, 4.53700000e+02, 1.34700000e+03, 4.36300000e+02, 9.65400000e+03, 1.88100000e+00, 4.07700000e+03, 2.03000000e+02, 3.06000000e+02, 1.19180000e+04, 2.87000000e+03, 1.17000000e+03, 1.00000000e+00], [ 2.00500000e+03, 3.55900000e+01, 3.44500000e+02, 8.43900000e+02, 3.08900000e+02, 4.99400000e+02, 1.32700000e+03, 4.22500000e+02, 9.62600000e+03, 1.89400000e+00, 4.18300000e+03, 2.04400000e+02, 3.04600000e+02, 1.18460000e+04, 2.94800000e+03, 1.18200000e+03, 1.00000000e+00]]) thank you very much. -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.raybaut at gmail.com Sun Jun 12 08:19:35 2011 From: pierre.raybaut at gmail.com (Pierre Raybaut) Date: Sun, 12 Jun 2011 14:19:35 +0200 Subject: [Numpy-discussion] ANN: Spyder v2.0.12 Message-ID: Hi all, I am pleased to announced that Spyder v2.0.12 has just been released (changelog available here: http://code.google.com/p/spyderlib/wiki/ChangeLog). This is the last maintenance release of version 2.0, until the forthcoming v2.1 release which is scheduled for the end of the month (see the roadmap here: http://code.google.com/p/spyderlib/wiki/Roadmap). Spyder (previously known as Pydee) is a free open-source Python development environment providing MATLAB-like features in a simple and light-weighted software, available for Windows XP/Vista/7, GNU/Linux and MacOS X: http://spyderlib.googlecode.com/. Spyder is also a library (spyderlib) providing *pure-Python* (PyQt/PySide) editors widgets: * source code editor: * efficient syntax highlighting (Python, C/C++, Fortran, html/css, gettext, ...) * code completion and calltips (powered by `rope`) * real-time code analysis (powered by `pyflakes`) * etc. (occurrence highlighting, ...) * NumPy array editor * Dictionnary editor * and many more widgets ("find in files", "text import wizard", ...) For those interested by these powerful widgets, note that: * spyderlib v2.0 and v2.1 are compatible with PyQt >= v4.4 (API #1) * spyderlib v2.2 is compatible with both PyQt >= v4.6 (API #2) and PySide (already available through the main source repository) (Spyder IDE itself -even v2.2- is not fully compatible with PySide -- only the "light" version is currently working) Cheers, Pierre From lev at columbia.edu Sun Jun 12 09:34:35 2011 From: lev at columbia.edu (Lev Givon) Date: Sun, 12 Jun 2011 09:34:35 -0400 Subject: [Numpy-discussion] ndarray display in ipython In-Reply-To: References: Message-ID: <20110612133434.GO19483@avicenna.ee.columbia.edu> Received from Chao YUE on Sun, Jun 12, 2011 at 08:10:41AM EDT: > Dear all pythoners, > > Does anybody know how I can choose the default display style for the data? > like in the following case, the data are display in a scientific notation if > they are ver small. > another question is, the display fold the data in first row into several > rows if there is still space on my computer screen, how can I adjust it to a > more viewer friendly way? > > In [97]: npdata > > array([[ 2.00300000e+03, 4.44500000e+01, 3.56600000e+02, > 8.73500000e+02, 3.12200000e+02, 5.17000000e+02, > 1.37100000e+03, 4.54600000e+02, 9.68400000e+03, > 1.89500000e+00, 3.97300000e+03, 2.04500000e+02, > 3.10200000e+02, 1.19940000e+04, 2.76800000e+03, > 1.14100000e+03, 1.00000000e+00], > [ 2.00400000e+03, 2.63600000e+01, 3.11400000e+02, > 7.65100000e+02, 2.85100000e+02, 4.53700000e+02, > 1.34700000e+03, 4.36300000e+02, 9.65400000e+03, > 1.88100000e+00, 4.07700000e+03, 2.03000000e+02, > 3.06000000e+02, 1.19180000e+04, 2.87000000e+03, > 1.17000000e+03, 1.00000000e+00], > [ 2.00500000e+03, 3.55900000e+01, 3.44500000e+02, > 8.43900000e+02, 3.08900000e+02, 4.99400000e+02, > 1.32700000e+03, 4.22500000e+02, 9.62600000e+03, > 1.89400000e+00, 4.18300000e+03, 2.04400000e+02, > 3.04600000e+02, 1.18460000e+04, 2.94800000e+03, > 1.18200000e+03, 1.00000000e+00]]) > > thank you very much. > Take a look at the numpy.set_printoptions function. L.G. From gruben at bigpond.net.au Sun Jun 12 09:38:57 2011 From: gruben at bigpond.net.au (gary ruben) Date: Sun, 12 Jun 2011 23:38:57 +1000 Subject: [Numpy-discussion] ndarray display in ipython In-Reply-To: References: Message-ID: You control this with numpy.set_printoptions: http://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html On Sun, Jun 12, 2011 at 10:10 PM, Chao YUE wrote: > Dear all pythoners, > > Does anybody know how I can choose the default display style for the data? > like in the following case, the data are display in a scientific notation if > they are ver small. > another question is, the display fold the data in?first row into several > rows if there is still space on my computer screen, how can I adjust it to a > more viewer friendly way? > > In [97]: npdata > > array([[? 2.00300000e+03,?? 4.44500000e+01,?? 3.56600000e+02, > ????????? 8.73500000e+02,?? 3.12200000e+02,?? 5.17000000e+02, > ????????? 1.37100000e+03,?? 4.54600000e+02,?? 9.68400000e+03, > ????????? 1.89500000e+00,?? 3.97300000e+03,?? 2.04500000e+02, > ????????? 3.10200000e+02,?? 1.19940000e+04,?? 2.76800000e+03, > ????????? 1.14100000e+03,?? 1.00000000e+00], > ?????? [? 2.00400000e+03,?? 2.63600000e+01,?? 3.11400000e+02, > ????????? 7.65100000e+02,?? 2.85100000e+02,?? 4.53700000e+02, > ????????? 1.34700000e+03,?? 4.36300000e+02,?? 9.65400000e+03, > ????????? 1.88100000e+00,?? 4.07700000e+03,?? 2.03000000e+02, > ????????? 3.06000000e+02,?? 1.19180000e+04,?? 2.87000000e+03, > ????????? 1.17000000e+03,?? 1.00000000e+00], > ?????? [? 2.00500000e+03,?? 3.55900000e+01,?? 3.44500000e+02, > ????????? 8.43900000e+02,?? 3.08900000e+02,?? 4.99400000e+02, > ????????? 1.32700000e+03,?? 4.22500000e+02,?? 9.62600000e+03, > ????????? 1.89400000e+00,?? 4.18300000e+03,?? 2.04400000e+02, > ????????? 3.04600000e+02,?? 1.18460000e+04,?? 2.94800000e+03, > ????????? 1.18200000e+03,?? 1.00000000e+00]]) > > thank you very much. > > -- > *********************************************************************************** > Chao YUE > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > UMR 1572 CEA-CNRS-UVSQ > Batiment 712 - Pe 119 > 91191 GIF Sur YVETTE Cedex > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > ************************************************************************************ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From ralf.gommers at googlemail.com Sun Jun 12 12:30:15 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 12 Jun 2011 18:30:15 +0200 Subject: [Numpy-discussion] time for numpy 1.6.1? Message-ID: We've accumulated a reasonable amount of bug fixes in the 1.6.x branch, including some for functionality new in 1.6.0 (iterator, f2py) and a distutils regression. So I think it makes sense to do a bugfix release. Is there anything urgent that should still go in? Otherwise I can tag an RC tomorrow. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sun Jun 12 12:51:43 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 12 Jun 2011 10:51:43 -0600 Subject: [Numpy-discussion] time for numpy 1.6.1? In-Reply-To: References: Message-ID: On Sun, Jun 12, 2011 at 10:30 AM, Ralf Gommers wrote: > We've accumulated a reasonable amount of bug fixes in the 1.6.x branch, > including some for functionality new in 1.6.0 (iterator, f2py) and a > distutils regression. So I think it makes sense to do a bugfix release. Is > there anything urgent that should still go in? Otherwise I can tag an RC > tomorrow. > Can documentation additions still go in after the RC? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 12 12:56:59 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 12 Jun 2011 18:56:59 +0200 Subject: [Numpy-discussion] time for numpy 1.6.1? In-Reply-To: References: Message-ID: On Sun, Jun 12, 2011 at 6:51 PM, Charles R Harris wrote: > > > On Sun, Jun 12, 2011 at 10:30 AM, Ralf Gommers < > ralf.gommers at googlemail.com> wrote: > >> We've accumulated a reasonable amount of bug fixes in the 1.6.x branch, >> including some for functionality new in 1.6.0 (iterator, f2py) and a >> distutils regression. So I think it makes sense to do a bugfix release. Is >> there anything urgent that should still go in? Otherwise I can tag an RC >> tomorrow. >> > > Can documentation additions still go in after the RC? > > That's fine with me, if the changes are not too extensive. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgautier at gmail.com Sun Jun 12 15:00:30 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Sun, 12 Jun 2011 21:00:30 +0200 Subject: [Numpy-discussion] Install error for numpy 1.6 Message-ID: <4DF50CCE.1030506@gmail.com> Hi, I did not find the following problem reported. When trying to install Numpy 1.6 with Python 2.7.1+ (r271:86832), gcc 4.5.2, and pip 1.0.1 (through a virtualenv 1.4.2 Python) it fails fails with: File "/usr/lib/python2.7/distutils/command/config.py", line 103, in _check_compiler customize_compiler(self.compiler) File "/usr/lib/python2.7/distutils/ccompiler.py", line 44, in customize_compiler cpp = cc + " -E" # not always TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' Complete output from command python setup.py egg_info: Running from numpy source directory.non-existing path in 'numpy/distutils': 'site.cfg' When crawling up the trace of errors, the jump between numpy's files and distutils appears to be at: File "numpy/core/setup.py", line 694, in get_mathlib_info st = config_cmd.try_link('int main(void) { return 0;}') File "/usr/lib/python2.7/distutils/command/config.py", line 248, in try_link self._check_compiler() Is this know ? Is there a workaround ? Thanks, Laurent From ralf.gommers at googlemail.com Sun Jun 12 15:08:09 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 12 Jun 2011 21:08:09 +0200 Subject: [Numpy-discussion] Install error for numpy 1.6 In-Reply-To: <4DF50CCE.1030506@gmail.com> References: <4DF50CCE.1030506@gmail.com> Message-ID: On Sun, Jun 12, 2011 at 9:00 PM, Laurent Gautier wrote: > Hi, > > I did not find the following problem reported. > > When trying to install Numpy 1.6 with Python 2.7.1+ (r271:86832), gcc > 4.5.2, and pip 1.0.1 (through a virtualenv 1.4.2 Python) it fails fails > with: > > File "/usr/lib/python2.7/distutils/command/config.py", line 103, > in _check_compiler > customize_compiler(self.compiler) > File "/usr/lib/python2.7/distutils/ccompiler.py", line 44, in > customize_compiler > cpp = cc + " -E" # not always > TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' > Complete output from command python setup.py egg_info: > Running from numpy source directory.non-existing path in > 'numpy/distutils': 'site.cfg' > > When crawling up the trace of errors, the jump between numpy's files and > distutils appears to be at: > > File "numpy/core/setup.py", line 694, in get_mathlib_info > st = config_cmd.try_link('int main(void) { return 0;}') > File "/usr/lib/python2.7/distutils/command/config.py", line 248, in > try_link > self._check_compiler() > > > Is this know ? Is there a workaround ? > > Looks like a new one. Does a normal "python setup.py" in your activated virtualenv work? If not, something is wrong in your environment. If it does work, you have your workaround. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From chaoyuejoy at gmail.com Sun Jun 12 16:18:47 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Sun, 12 Jun 2011 22:18:47 +0200 Subject: [Numpy-discussion] ndarray display in ipython In-Reply-To: References: Message-ID: Hi, Thank you. That's it! Chao 2011/6/12 gary ruben > You control this with numpy.set_printoptions: > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html > > On Sun, Jun 12, 2011 at 10:10 PM, Chao YUE wrote: > > Dear all pythoners, > > > > Does anybody know how I can choose the default display style for the > data? > > like in the following case, the data are display in a scientific notation > if > > they are ver small. > > another question is, the display fold the data in first row into several > > rows if there is still space on my computer screen, how can I adjust it > to a > > more viewer friendly way? > > > > In [97]: npdata > > > > array([[ 2.00300000e+03, 4.44500000e+01, 3.56600000e+02, > > 8.73500000e+02, 3.12200000e+02, 5.17000000e+02, > > 1.37100000e+03, 4.54600000e+02, 9.68400000e+03, > > 1.89500000e+00, 3.97300000e+03, 2.04500000e+02, > > 3.10200000e+02, 1.19940000e+04, 2.76800000e+03, > > 1.14100000e+03, 1.00000000e+00], > > [ 2.00400000e+03, 2.63600000e+01, 3.11400000e+02, > > 7.65100000e+02, 2.85100000e+02, 4.53700000e+02, > > 1.34700000e+03, 4.36300000e+02, 9.65400000e+03, > > 1.88100000e+00, 4.07700000e+03, 2.03000000e+02, > > 3.06000000e+02, 1.19180000e+04, 2.87000000e+03, > > 1.17000000e+03, 1.00000000e+00], > > [ 2.00500000e+03, 3.55900000e+01, 3.44500000e+02, > > 8.43900000e+02, 3.08900000e+02, 4.99400000e+02, > > 1.32700000e+03, 4.22500000e+02, 9.62600000e+03, > > 1.89400000e+00, 4.18300000e+03, 2.04400000e+02, > > 3.04600000e+02, 1.18460000e+04, 2.94800000e+03, > > 1.18200000e+03, 1.00000000e+00]]) > > > > thank you very much. > > > > -- > > > *********************************************************************************** > > Chao YUE > > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > > UMR 1572 CEA-CNRS-UVSQ > > Batiment 712 - Pe 119 > > 91191 GIF Sur YVETTE Cedex > > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > > > ************************************************************************************ > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sun Jun 12 16:23:26 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sun, 12 Jun 2011 15:23:26 -0500 Subject: [Numpy-discussion] time for numpy 1.6.1? In-Reply-To: References: Message-ID: On Sun, Jun 12, 2011 at 11:30 AM, Ralf Gommers wrote: > We've accumulated a reasonable amount of bug fixes in the 1.6.x branch, > including some for functionality new in 1.6.0 (iterator, f2py) and a > distutils regression. So I think it makes sense to do a bugfix release. Is > there anything urgent that should still go in? Otherwise I can tag an RC > tomorrow. > Sounds great to me. -Mark > Cheers, > Ralf > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From butterw at gmail.com Sun Jun 12 16:46:48 2011 From: butterw at gmail.com (Peter Butterworth) Date: Sun, 12 Jun 2011 22:46:48 +0200 Subject: [Numpy-discussion] np.histogram: upper range bin In-Reply-To: References: Message-ID: Consistent bin width is important for my applications. With floating point numbers I usually shift my bins by a small offset to ensure values at bin edges always fall in the correct bin. With the current np.histogram behavior you _silently_ get a wrong count in the top bin if a value falls on the upper bin limit. Incidentally this happens by default with integers. ex: x=range(4); np.histogram(x) I guess it is better to always specify the correct range, but wouldn't it be preferable if the function provided a warning when this case occurs ? Likely I will test for the following condition when using np.histogram(x): max(x) == top bin limit --- Re: [Numpy-discussion] np.histogram: upper range bin Christopher Barker Thu, 02 Jun 2011 09:19:16 -0700 Peter Butterworth wrote: > in np.histogram the top-most bin edge is inclusive of the upper range > limit. As documented in the docstring (see below) this is actually the > expected behavior, but this can lead to some weird enough results: > > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) > Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) > > Is there any way round this or an alternative implementation without > this issue ? The way around it is what you've identified -- making sure your bins are right. But I think the current behavior is the way it "should" be. It keeps folks from inadvertently loosing stuff off the end -- the lower end is inclusive, so the upper end should be too. In the middle bins, one has to make an arbitrary cut-off, and put the values on the "line" somewhere. One thing to keep in mind is that, in general, histogram is designed for floating point numbers, not just integers -- counting integers can be accomplished other ways, if that's what you really want (see np.bincount). But back to your example: > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Why do you want only 3 bins here? using 4 gives you what you want. If you want more control, then it seems you really want to know how many of each of the values 1,2,3,4 there are. so you want 4 bins, each *centered* on the integers, so you might do: In [8]: np.histogram(x, bins=4, range=(0.5, 4.5)) Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) or, if you want to be more explicit: In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5)) Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception From scott.sinclair.za at gmail.com Mon Jun 13 07:08:19 2011 From: scott.sinclair.za at gmail.com (Scott Sinclair) Date: Mon, 13 Jun 2011 13:08:19 +0200 Subject: [Numpy-discussion] Install error for numpy 1.6 In-Reply-To: References: <4DF50CCE.1030506@gmail.com> Message-ID: On 12 June 2011 21:08, Ralf Gommers wrote: > > On Sun, Jun 12, 2011 at 9:00 PM, Laurent Gautier wrote: >> >> I did not find the following problem reported. >> >> When trying to install Numpy 1.6 with Python 2.7.1+ (r271:86832), gcc >> 4.5.2, and pip 1.0.1 (through a virtualenv 1.4.2 Python) it fails fails >> with: >> >> ? ? ? File "/usr/lib/python2.7/distutils/command/config.py", line 103, >> in _check_compiler >> ? ? ? ? customize_compiler(self.compiler) >> ? ? ? File "/usr/lib/python2.7/distutils/ccompiler.py", line 44, in >> customize_compiler >> ? ? ? ? cpp = cc + " -E" ? ? ? ? ? # not always >> ? ? TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' >> ? ? Complete output from command python setup.py egg_info: >> ? ? Running from numpy source directory.non-existing path in >> 'numpy/distutils': 'site.cfg' >> > Looks like a new one. Does a normal "python setup.py" in your activated > virtualenv work? If not, something is wrong in your environment. If it does > work, you have your workaround. See http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056544.html for a workaround (at least to get "python setup.py install" working, I haven't tried with pip/easy_install etc..) Cheers, Scott From ralf.gommers at googlemail.com Mon Jun 13 08:58:26 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 13 Jun 2011 14:58:26 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 Message-ID: Hi, I am pleased to announce the availability of the first release candidate of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: #1834 einsum fails for specific shapes #1837 einsum throws nan or freezes python for specific array shapes #1838 object <-> structured type arrays regression #1851 regression for SWIG based code in 1.6.0 #1863 Buggy results when operating on array copied with astype() If no problems are reported, the final release will be in one week. Sources and binaries can be found at https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ Enjoy, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Mon Jun 13 09:40:25 2011 From: shish at keba.be (Olivier Delalleau) Date: Mon, 13 Jun 2011 09:40:25 -0400 Subject: [Numpy-discussion] numpy type mismatch In-Reply-To: References: Message-ID: 2011/6/10 Olivier Delalleau > 2011/6/10 Charles R Harris > >> >> >> On Fri, Jun 10, 2011 at 3:43 PM, Benjamin Root wrote: >> >>> >>> >>> On Fri, Jun 10, 2011 at 3:24 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> On Fri, Jun 10, 2011 at 2:17 PM, Benjamin Root wrote: >>>> >>>>> >>>>> >>>>> On Fri, Jun 10, 2011 at 3:02 PM, Charles R Harris < >>>>> charlesr.harris at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 10, 2011 at 1:50 PM, Benjamin Root wrote: >>>>>> >>>>>>> Came across an odd error while using numpy master. Note, my system >>>>>>> is 32-bits. >>>>>>> >>>>>>> >>> import numpy as np >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) == np.int32 >>>>>>> False >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int64)) == np.int64 >>>>>>> True >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float32)) == np.float32 >>>>>>> True >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.float64)) == np.float64 >>>>>>> True >>>>>>> >>>>>>> So, only the summation performed with a np.int32 accumulator results >>>>>>> in a type that doesn't match the expected type. Now, for even more >>>>>>> strangeness: >>>>>>> >>>>>>> >>> type(np.sum([1, 2, 3], dtype=np.int32)) >>>>>>> >>>>>>> >>> hex(id(type(np.sum([1, 2, 3], dtype=np.int32)))) >>>>>>> '0x9599a0' >>>>>>> >>> hex(id(np.int32)) >>>>>>> '0x959a80' >>>>>>> >>>>>>> So, the type from the sum() reports itself as a numpy int, but its >>>>>>> memory address is different from the memory address for np.int32. >>>>>>> >>>>>>> >>>>>> One of them is probably a long, print out the typecode, dtype.char. >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>>> >>>>> Good intuition, but odd result... >>>>> >>>>> >>> import numpy as np >>>>> >>> a = np.sum([1, 2, 3], dtype=np.int32) >>>>> >>> b = np.int32(6) >>>>> >>> type(a) >>>>> >>>>> >>> type(b) >>>>> >>>>> >>> a.dtype.char >>>>> 'i' >>>>> >>> b.dtype.char >>>>> 'l' >>>>> >>>>> So, the standard np.int32 is getting listed as a long somehow? To >>>>> further investigate: >>>>> >>>>> >>>> Yes, long shifts around from int32 to int64 depending on the OS. For >>>> instance, in 64 bit Windows it's 32 bits while in 64 bit Linux it's 64 bits. >>>> On 32 bit systems it is 32 bits. >>>> >>>> Chuck >>>> >>>> >>> Right, that makes sense. But, the question is why does sum() put out a >>> result dtype that is not identical to the dtype that I requested, or even >>> the dtype of the input array? Could this be an indication of a bug >>> somewhere? Even if the bug is harmless (it was only noticed within the test >>> suite of larry), is this unexpected? >>> >>> >> I expect sum is using a ufunc and it acts differently on account of the >> cleanup of the ufunc casting rules. And yes, a long *is* int32 on your >> machine. On mine >> >> In [4]: dtype('q') # long long >> Out[4]: dtype('int64') >> >> In [5]: dtype('l') # long >> Out[5]: dtype('int64') >> >> The mapping from C types to numpy width types isn't 1-1. Personally, I >> think we should drop long ;) But it used to be the standard Python type in >> the C API. Mark has also pointed out the problems/confusion this ambiguity >> causes and someday we should probably think it out and fix it. But I don't >> think it is the most pressing problem. >> >> Chuck >> >> > But isn't it a bug if numpy.dtype('i') != numpy.dtype('l') on a 32 bit > computer where both are int32? > > -=- Olivier > So, I tried with the latest 1.6.1rc1. It turns out I misinterpreted the OP. Although the first test indeed returns False now, I hadn't taken into account it was a comparison of types, not dtypes. If you replace it with (np.sum([1, 2, 3], dtype=np.int32)).dtype == np.dtype(np.int32) then it returns True. So we still have numpy.dtype('i') == numpy.dtype('l'), and I'm not concerned anymore :) Sorry for the false alarm... -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 13 10:17:10 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Mon, 13 Jun 2011 16:17:10 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> Hi Ralf, > I am pleased to announce the availability of the first release candidate of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > #1834 einsum fails for specific shapes > #1837 einsum throws nan or freezes python for specific array shapes > #1838 object <-> structured type arrays regression > #1851 regression for SWIG based code in 1.6.0 > #1863 Buggy results when operating on array copied with astype() there are a bunch of test failures under Python3, mostly with new datetime tests trivially fixed by 'str' -> asbytes('str') conversions (I can send a patch or pull request for that), and two more I can't provide a fix for yet: FAIL: Test custom format function for each element in array. ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_arrayprint.py", line 86, in test_format_function "[0x0L 0x1L 0x2L]") File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: '[0x0 0x1 0x2]' DESIRED: '[0x0L 0x1L 0x2L]' - this is x = np.arange(3) assert_(np.array2string(x, formatter={'all':_format_function}) == \ "[. o O]") assert_(np.array2string(x, formatter={'int_kind':_format_function}) ==\ "[. o O]") assert_(np.array2string(x, formatter={'all':lambda x: "%.4f" % x}) == \ "[0.0000 1.0000 2.0000]") assert_equal(np.array2string(x, formatter={'int':lambda x: hex(x)}), \ "[0x0L 0x1L 0x2L]") (btw. these were all assert_(a == b) before, which do not give a useful error message by default - any reason not to change them to assert_equal(, b)?) and FAIL: Ticket #1748 ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_regression.py", line 1549, in test_string_astype assert_equal(b.dtype, np.dtype('S5')) File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: dtype(' References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> Message-ID: Hi Derek, On Mon, Jun 13, 2011 at 4:17 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > Hi Ralf, > > > I am pleased to announce the availability of the first release candidate > of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > > #1834 einsum fails for specific shapes > > #1837 einsum throws nan or freezes python for specific array shapes > > #1838 object <-> structured type arrays regression > > #1851 regression for SWIG based code in 1.6.0 > > #1863 Buggy results when operating on array copied with astype() > > there are a bunch of test failures under Python3, mostly with new datetime > tests > trivially fixed by 'str' -> asbytes('str') conversions (I can send a patch > or pull request > for that), and two more I can't provide a fix for yet: Thanks for testing. > > FAIL: Test custom format function for each element in array. > ---------------------------------------------------------------------- > This test is not in 1.6.x, only in master. I suspect the same is true for the datetime tests, but perhaps not for the S5/U5 thing. Can you clean your install dir and try again? Thanks, Ralf > Traceback (most recent call last): > File > "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_arrayprint.py", > line 86, in test_format_function > "[0x0L 0x1L 0x2L]") > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 313, in assert_equal > raise AssertionError(msg) > AssertionError: > Items are not equal: > ACTUAL: '[0x0 0x1 0x2]' > DESIRED: '[0x0L 0x1L 0x2L]' > > - this is > x = np.arange(3) > assert_(np.array2string(x, formatter={'all':_format_function}) == \ > "[. o O]") > assert_(np.array2string(x, formatter={'int_kind':_format_function}) > ==\ > "[. o O]") > assert_(np.array2string(x, formatter={'all':lambda x: "%.4f" % x}) > == \ > "[0.0000 1.0000 2.0000]") > assert_equal(np.array2string(x, formatter={'int':lambda x: hex(x)}), > \ > "[0x0L 0x1L 0x2L]") > > (btw. these were all assert_(a == b) before, which do not give a useful > error message > by default - any reason not to change them to assert_equal(, b)?) > > and > FAIL: Ticket #1748 > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_regression.py", > line 1549, in test_string_astype > assert_equal(b.dtype, np.dtype('S5')) > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 313, in assert_equal > raise AssertionError(msg) > AssertionError: > Items are not equal: > ACTUAL: dtype(' DESIRED: dtype('|S5') > > for > > s1 = asbytes('black') > s2 = asbytes('white') > s3 = asbytes('other') > a = np.array([[s1],[s2],[s3]]) > assert_equal(a.dtype, np.dtype('S5')) > b = a.astype('str') > assert_equal(b.dtype, np.dtype('S5')) > > I can get around this by changing the last lines to > > b = a.astype(np.dtype('S5')) > assert_equal(b.dtype, np.dtype('S5')) > > but am not sure if this preserves the purpose of the test... > > HTH, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 13 11:11:48 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Mon, 13 Jun 2011 17:11:48 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> Message-ID: <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> Hi Ralf, > > FAIL: Test custom format function for each element in array. > ---------------------------------------------------------------------- > > This test is not in 1.6.x, only in master. I suspect the same is true for the datetime tests, but perhaps not for the S5/U5 thing. Can you clean your install dir and try again? > you're right - I've tried to download the tarball, but am getting connection errors or incomplete downloads from all available SF mirrors, and apparently I was still too thick to figure out how to checkout a specific tag... After a fresh checkout of maintenance/1.6.x all tests succeed (on OS X 10.6 x86_64). I've put the test issues with master on https://github.com/dhomeier/numpy/tree/dt_tests_2to3 Cheers, Derek From Chris.Barker at noaa.gov Mon Jun 13 11:45:52 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Mon, 13 Jun 2011 08:45:52 -0700 Subject: [Numpy-discussion] np.histogram: upper range bin In-Reply-To: References: Message-ID: <4DF630B0.8030804@noaa.gov> Peter Butterworth wrote: > Consistent bin width is important for my applications. With floating > point numbers I usually shift my bins by a small offset to ensure > values at bin edges always fall in the correct bin. > With the current np.histogram behavior you _silently_ get a wrong > count in the top bin if a value falls on the upper bin limit. > Incidentally this happens by default with integers. ex: x=range(4); > np.histogram(x) Again, the trick here is that histogram really is for floating point numbers, not integers. > Likely I will test for the following condition when using > np.histogram(x): > max(x) == top bin limit from the docstring: """ range : (float, float), optional The lower and upper range of the bins. If not provided, range is simply ``(a.min(), a.max())``. """ So, in fact, if you don't specify the range, the top of the largest bin will ALWAYS be the max value in your data. You seem to be advocating that that value not be included in any bins -- which would surely not be a good option. > With the current np.histogram behavior you _silently_ get a wrong > count in the top bin if a value falls on the upper bin limit. It is not a wrong count -- it is absolutely correct. I think the issue here is that you are binning integers, which is really a categorical binning, not a value-based one. If you want categorical binning, there are other ways to do that. I suppose having np.histogram do something different if the input is integers might make sense, but that's probably not worth it. By the way, what would you have it do it you had integers, but a large number, over a large range of values: In [68]: x = numpy.random.randint(0, 100000000, size=(100,) ) In [70]: np.histogram(x) Out[70]: (array([10, 10, 12, 16, 12, 9, 11, 4, 9, 7]), array([ 712131. , 10437707.4 , 20163283.8 , 29888860.2 , 39614436.6 , 49340013. , 59065589.40000001, 68791165.8 , 78516742.2 , 88242318.60000001, 97967895. ])) > I guess it is better to always specify the correct range, but wouldn't > it be preferable if the function provided a > warning when this case occurs ? > > > --- > Re: [Numpy-discussion] np.histogram: upper range bin > Christopher Barker > Thu, 02 Jun 2011 09:19:16 -0700 > > Peter Butterworth wrote: >> in np.histogram the top-most bin edge is inclusive of the upper range >> limit. As documented in the docstring (see below) this is actually the >> expected behavior, but this can lead to some weird enough results: >> >> In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) >> Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) >> >> Is there any way round this or an alternative implementation without >> this issue ? > > The way around it is what you've identified -- making sure your bins are > right. But I think the current behavior is the way it "should" be. It > keeps folks from inadvertently loosing stuff off the end -- the lower > end is inclusive, so the upper end should be too. In the middle bins, > one has to make an arbitrary cut-off, and put the values on the "line" > somewhere. > > > One thing to keep in mind is that, in general, histogram is designed for > floating point numbers, not just integers -- counting integers can be > accomplished other ways, if that's what you really want (see > np.bincount). But back to your example: > > > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) > > Why do you want only 3 bins here? using 4 gives you what you want. If > you want more control, then it seems you really want to know how many of > each of the values 1,2,3,4 there are. so you want 4 bins, each > *centered* on the integers, so you might do: > > In [8]: np.histogram(x, bins=4, range=(0.5, 4.5)) > Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) > > or, if you want to be more explicit: > > In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5)) > Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) > > > HTH, > > -Chris > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From bsouthey at gmail.com Mon Jun 13 12:08:18 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Jun 2011 11:08:18 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> Message-ID: <4DF635F2.8030406@gmail.com> On 06/13/2011 10:11 AM, Derek Homeier wrote: > Hi Ralf, > >> FAIL: Test custom format function for each element in array. >> ---------------------------------------------------------------------- >> >> This test is not in 1.6.x, only in master. I suspect the same is true for the datetime tests, but perhaps not for the S5/U5 thing. Can you clean your install dir and try again? >> > you're right - I've tried to download the tarball, but am getting connection errors or incomplete > downloads from all available SF mirrors, and apparently I was still too thick to figure out how > to checkout a specific tag... > After a fresh checkout of maintenance/1.6.x all tests succeed (on OS X 10.6 x86_64). > > I've put the test issues with master on > > https://github.com/dhomeier/numpy/tree/dt_tests_2to3 > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion I used wget using the direct link and it eventually got the complete file after multiple tries. Anyhow, the tests passed for me on Linux 64-bit for Python 2.4, 2.5, 2.7 and 3.1. My Python 2.6 is not a standard build so I am avoiding it. I do get an error with Python3.2 Bruce $ python3.2 -c "import numpy; numpy.test()" Running unit tests for numpy NumPy version 1.6.1rc1 NumPy is installed in /usr/local/lib/python3.2/site-packages/numpy Python version 3.2 (r32:88445, Mar 23 2011, 11:36:08) [GCC 4.5.1 20100924 (Red Hat 4.5.1-4)] nose version 3.0.0 .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................K............................................................................................................K.................................................................................................K......................K.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/local/lib/python3.2/site-packages/numpy/lib/format.py:575: ResourceWarning: unclosed file <_io.BufferedReader name='/tmp/tmp66k88a'> mode=mode, offset=offset) ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................/usr/local/lib/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=3> return Popen(*popenargs, **kwargs).wait() /usr/local/lib/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=8> return Popen(*popenargs, **kwargs).wait() ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E........ ====================================================================== ERROR: Failure: OSError (/usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot open shared object file: No such file or directory) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/failure.py", line 37, in runTest reraise(self.exc_class, self.exc_val, self.tb) File "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/_3.py", line 7, in reraise raise exc_class(exc_val).with_traceback(tb) File "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/loader.py", line 389, in loadTestsFromName addr.filename, addr.module) File "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", line 39, in importFromPath return self.importFromDir(dir_path, fqname) File "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", line 86, in importFromDir mod = load_module(part_fqname, fh, filename, desc) File "/usr/local/lib/python3.2/site-packages/numpy/tests/test_ctypeslib.py", line 8, in cdll = load_library('multiarray', np.core.multiarray.__file__) File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", line 122, in load_library raise exc File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", line 119, in load_library return ctypes.cdll[libpath] File "/usr/local/lib/python3.2/ctypes/__init__.py", line 415, in __getitem__ return getattr(self, name) File "/usr/local/lib/python3.2/ctypes/__init__.py", line 410, in __getattr__ dll = self._dlltype(name) File "/usr/local/lib/python3.2/ctypes/__init__.py", line 340, in __init__ self._handle = _dlopen(self._name, mode) OSError: /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot open shared object file: No such file or directory ---------------------------------------------------------------------- Ran 3525 tests in 33.713s FAILED (KNOWNFAIL=5, errors=1) From ralf.gommers at googlemail.com Mon Jun 13 12:19:38 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 13 Jun 2011 18:19:38 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4DF635F2.8030406@gmail.com> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: On Mon, Jun 13, 2011 at 6:08 PM, Bruce Southey wrote: > On 06/13/2011 10:11 AM, Derek Homeier wrote: > > I used wget using the direct link and it eventually got the complete > file after multiple tries. > Yes, SF is having a pretty bad day. > > Anyhow, the tests passed for me on Linux 64-bit for Python 2.4, 2.5, 2.7 > and 3.1. My Python 2.6 is not a standard build so I am avoiding it. I do > get an error with Python3.2 > > That's an odd error. Did that work for you with 1.6.0? Could you investigate a bit? Thanks, Ralf > Bruce > > $ python3.2 -c "import numpy; numpy.test()" > Running unit tests for numpy > NumPy version 1.6.1rc1 > NumPy is installed in /usr/local/lib/python3.2/site-packages/numpy > Python version 3.2 (r32:88445, Mar 23 2011, 11:36:08) [GCC 4.5.1 > 20100924 (Red Hat 4.5.1-4)] > nose version 3.0.0 > > .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................K............................................................................................................K.................................. > > ...............................................................K......................K...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... > > ............................................................................................................................................................................./usr/local/lib/python3.2/site-packages/numpy/lib/format.py:575: > ResourceWarning: unclosed file <_io.BufferedReader name='/tmp/tmp66k88a'> > mode=mode, offset=offset) > > ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................/usr/local/lib/python3.2/subprocess.py:460: > ResourceWarning: unclosed file <_io.BufferedReader name=3> > return Popen(*popenargs, **kwargs).wait() > /usr/local/lib/python3.2/subprocess.py:460: ResourceWarning: unclosed > file <_io.BufferedReader name=8> > return Popen(*popenargs, **kwargs).wait() > > ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E........ > ====================================================================== > ERROR: Failure: OSError > (/usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: > cannot open shared object file: No such file or directory) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/failure.py", > line 37, in runTest > reraise(self.exc_class, self.exc_val, self.tb) > File > > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/_3.py", > line 7, in reraise > raise exc_class(exc_val).with_traceback(tb) > File > > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/loader.py", > line 389, in loadTestsFromName > addr.filename, addr.module) > File > > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", > line 39, in importFromPath > return self.importFromDir(dir_path, fqname) > File > > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", > line 86, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File > "/usr/local/lib/python3.2/site-packages/numpy/tests/test_ctypeslib.py", > line 8, in > cdll = load_library('multiarray', np.core.multiarray.__file__) > File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", > line 122, in load_library > raise exc > File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", > line 119, in load_library > return ctypes.cdll[libpath] > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 415, in > __getitem__ > return getattr(self, name) > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 410, in > __getattr__ > dll = self._dlltype(name) > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 340, in __init__ > self._handle = _dlopen(self._name, mode) > OSError: > /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot > open shared object file: No such file or directory > > ---------------------------------------------------------------------- > Ran 3525 tests in 33.713s > > FAILED (KNOWNFAIL=5, errors=1) > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan at ajackson.org Mon Jun 13 12:59:30 2011 From: alan at ajackson.org (alan at ajackson.org) Date: Mon, 13 Jun 2011 11:59:30 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: References: Message-ID: <20110613115930.03749905@ajackson.org> I'm joining this late (I've been traveling), but it might be useful to look at the fairly new R module "lubridate". They have put quite some thought into simplifying date handling, and when I have used it I have generally been quite pleased. The documentation is quite readable. Just Google it and it will be at the top. >Hey all, > >So I'm doing a summer internship at Enthought, and the first thing they >asked me to look into is finishing the datetime type in numpy. It turns out >that the estimates of how complete the type was weren't accurate, and to >support what the NEP describes required generalizing the ufunc type >resolution system. I also found that the date/time parsing code (based on >mxDateTime) was not robust, producing something for almost any arbitrary >garbage input. I've replaced much of the broken code and implemented a lot >of the functionality, and thought this might be a good point to do a pull >request on what I've got and get feedback on the issues I've run into. > >* The existing datetime-related API is probably not useful, and in fact >those functions aren't used internally anymore. Is it reasonable to remove >the functions, or do we just deprecate them? > >* Leap seconds probably deserve a rigorous treatment, but having an internal >representation with leap-seconds overcomplicates otherwise very simple and >fast operations. Could we internally use a value matching TAI or GPS time? >Currently it's a UTC time in the present, but the 1970 epoch is then not the >UTC 1970 epoch, but 10s of seconds off, and this isn't properly specified. >What are people's opinions? The Python datetime.datetime doesn't support >leap seconds (seconds == 60 is disallowed). > >* Default conversion to string - should it be in UTC or with the local >timezone baked in? As UTC it may be confusing because 'now' will print as a >different time than people would expect. > >* Business days - The existing business idea doesn't seem very useful, >representing just the western M-F work week and not accounting for holidays. >I've come up with a design which might address these issues: Extend the >metadata for business days with a string identifier, like 'M8[B:USA]', then >have a global internal dictionary which maps 'USA' to a workweek mask and a >list of holidays. The call to prepare this dictionary for a particular >business day type might look like np.set_business_days('USA', [1, 1, 1, 1, >1, 0, 0], np.array([ list of US holidays ], dtype='M8[D]')). Internally, >business days would be stored the same as regular days, but with special >treatment where landing on a weekend or holiday gives you back a NaT. > >If you are interested in the business day functionality, please comment on >this design! > >* The dtype constructor accepted 'O#' for object types, something I think >was wrong. I've removed that, but allow # to be 4 or 8, producing a >deprecation warning if it occurs. > >* Would it make sense to offset the week-based datetime's epoch so it aligns >with ISO 8601's week format? Jan 1, 1970 is a thursday, but the YYYY-Www >date format uses weeks starting on monday. I think producing strings in this >format when the datetime has units of weeks would be a natural thing to do. > >* Should the NaT (not-a-time) value behave like floating-point NaN? i.e. NaT >== NaT return false, etc. Should operations generating NaT trigger an >'invalid' floating point exception in ufuncs? > >Cheers, >Mark -- ----------------------------------------------------------------------- | Alan K. Jackson | To see a World in a Grain of Sand | | alan at ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | ----------------------------------------------------------------------- From Chris.Barker at noaa.gov Mon Jun 13 13:16:30 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Mon, 13 Jun 2011 10:16:30 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: <4DF645EE.7020801@noaa.gov> Brandt Belson wrote: > Unfortunately I can't flatten the arrays. I'm writing a library where > the user supplies an inner product function for two generic objects, and > almost always the inner product function does large array > multiplications at some point. The library doesn't get to know about the > underlying arrays. Now I'm confused -- if the user is providing the inner product implementation, how can you optimize that? Or are you trying to provide said user with an optimized "large array multiplication" that he/she can use? If so, then I'd post your implementation here, and folks can suggest improvements. If it's regular old element-wise multiplication: a*b (where a and b are numpy arrays) then you are right, numpy isn't using any fancy multi-core aware optimized package, so you should be able to make a faster version. You might try numexpr also -- it's pretty cool, though may not help for a single operation. It might give you some ideas, though. http://www.scipy.org/SciPyPackages/NumExpr -Chris > Thanks, > Brandt > > > > Message: 2 > Date: Fri, 10 Jun 2011 09:23:10 -0400 > From: Olivier Delalleau > > Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > with numpy array multiplication > To: Discussion of Numerical Python > > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > It may not work for you depending on your specific problem > constraints, but > if you could flatten the arrays, then it would be a dot, and you > could maybe > compute multiple such dot products by storing those flattened arrays > into a > matrix. > > -=- Olivier > > 2011/6/10 Brandt Belson > > > > Hi, > > Thanks for getting back to me. > > I'm doing element wise multiplication, basically innerProduct = > > numpy.sum(array1*array2) where array1 and array2 are, in general, > > multidimensional. I need to do many of these operations, and I'd > like to > > split up the tasks between the different cores. I'm not using > numpy.dot, if > > I'm not mistaken I don't think that would do what I need. > > Thanks again, > > Brandt > > > > > > Message: 1 > >> Date: Thu, 09 Jun 2011 13:11:40 -0700 > >> From: Christopher Barker > > >> Subject: Re: [Numpy-discussion] Using multiprocessing (shared > memory) > >> with numpy array multiplication > >> To: Discussion of Numerical Python > > >> Message-ID: <4DF128FC.8000807 at noaa.gov > > > >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >> > >> Not much time, here, but since you got no replies earlier: > >> > >> > >> > > I'm parallelizing some code I've written using the built in > >> > multiprocessing > >> > > module. In my application, I need to multiply many > large arrays > >> > together > >> > >> is the matrix multiplication, or element-wise? If matrix, then numpy > >> should be using LAPACK, which, depending on how its built, could be > >> using all your cores already. This is heavily dependent on your your > >> numpy (really the LAPACK it uses0 is built. > >> > >> > > and > >> > > sum the resulting product arrays (inner products). > >> > >> are you using numpy.dot() for that? If so, then the above applies to > >> that as well. > >> > >> I know I could look at your code to answer these questions, but I > >> thought this might help. > >> > >> -Chris > >> > >> > >> > >> > >> > >> -- > >> Christopher Barker, Ph.D. > >> Oceanographer > >> > >> Emergency Response Division > >> NOAA/NOS/OR&R (206) 526-6959 > voice > >> 7600 Sand Point Way NE (206) 526-6329 > fax > >> Seattle, WA 98115 (206) 526-6317 > main reception > >> > >> Chris.Barker at noaa.gov > >> > > > > ------------------------------------------------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From srean.list at gmail.com Mon Jun 13 13:51:08 2011 From: srean.list at gmail.com (srean) Date: Mon, 13 Jun 2011 12:51:08 -0500 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: Looking at the code the arrays that you are multiplying seem fairly small (300, 200) and you have 50 of them. So it might the case that there is not enough computational work to compensate for the cost of forking new processes and communicating the results. Have you tried larger arrays and more of them ? If you are on an intel machine and you have MKL libraries around I would strongly recommend that you use the matrix multiplication routine if possible. MKL will do the parallelization for you. Well, any good BLAS implementation would do the same, you dont really need MKL. ATLAS and ACML would work too, just that MKL has been setup for us and it works well. To give an idea, given the amount of tuning and optimization that these libraries have undergone a numpy.sum would be slower that an multiplication with a vector of all ones. So in the interest of speed the longer you stay in the BLAS context the better. --srean On Fri, Jun 10, 2011 at 10:01 AM, Brandt Belson wrote: > Unfortunately I can't flatten the arrays. I'm writing a library where the > user supplies an inner product function for two generic objects, and almost > always the inner product function does large array multiplications at some > point. The library doesn't get to know about the underlying arrays. > Thanks, > Brandt From rowen at uw.edu Mon Jun 13 14:53:37 2011 From: rowen at uw.edu (Russell E. Owen) Date: Mon, 13 Jun 2011 11:53:37 -0700 Subject: [Numpy-discussion] numpy build: automatic fortran detection References: Message-ID: In article , Ralf Gommers wrote: > On Thu, Jun 9, 2011 at 11:46 PM, Russell E. Owen wrote: > > > What would it take to automatically detect which flavor of fortran to > > use to build numpy on linux? > > > > You want to figure out which compiler was used to build BLAS/LAPACK/ATLAS > and check that the numpy build uses the same, right? Assuming you only care > about g77/gfortran, you can try this (from > http://docs.scipy.org/doc/numpy/user/install.html): > """ > How to check the ABI of blas/lapack/atlas > ----------------------------------------- > One relatively simple and reliable way to check for the compiler used to > build a library is to use ldd on the library. If libg2c.so is a dependency, > this means that g77 has been used. If libgfortran.so is a a dependency, > gfortran has been used. If both are dependencies, this means both have been > used, which is almost always a very bad idea. > """ > > You could do something similar for other compilers if needed. It would help > to know exactly what problem you are trying to solve. I'm trying to automate the process of figuring out which fortran to use because we have a build system that should install numpy for users, without user interaction. We found this out the hard way by not specifying a compiler and ending up with a numpy that was mis-built (according to its helpful unit tests). However, it appears that g77 is very old so I'm now wondering if it would make sense to switch to gfortran for the default? I think our own procedure will be to assume gfortran and complain to users if it's not right. (We can afford the time to run the test). > > The unit tests are clever enough to detect a mis-build (though > > surprisingly that is not done as part of the build process), so surely > > it can be done. > > > > The test suite takes some time to run. It would be very annoying if it ran > by default on every rebuild. It's easy to write a build script that builds > numpy, then runs the tests, if that's what you need. In this case the only relevant test is the test for the correct fortran compiler. That is the only test I was proposing be performed. -= RUssell From lgautier at gmail.com Mon Jun 13 15:02:22 2011 From: lgautier at gmail.com (lgautier) Date: Mon, 13 Jun 2011 12:02:22 -0700 (PDT) Subject: [Numpy-discussion] Install error for numpy 1.6 In-Reply-To: References: <4DF50CCE.1030506@gmail.com> Message-ID: <90ecfe8e-2486-4199-9dfd-2f88374c785a@r20g2000yqd.googlegroups.com> On Jun 13, 1:08?pm, Scott Sinclair wrote: > On 12 June 2011 21:08, Ralf Gommers wrote: > > > > > > > > > > > > > On Sun, Jun 12, 2011 at 9:00 PM, Laurent Gautier wrote: > > >> I did not find the following problem reported. > > >> When trying to install Numpy 1.6 with Python 2.7.1+ (r271:86832), gcc > >> 4.5.2, and pip 1.0.1 (through a virtualenv 1.4.2 Python) it fails fails > >> with: > > >> ? ? ? File "/usr/lib/python2.7/distutils/command/config.py", line 103, > >> in _check_compiler > >> ? ? ? ? customize_compiler(self.compiler) > >> ? ? ? File "/usr/lib/python2.7/distutils/ccompiler.py", line 44, in > >> customize_compiler > >> ? ? ? ? cpp = cc + " -E" ? ? ? ? ? # not always > >> ? ? TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' > >> ? ? Complete output from command python setup.py egg_info: > >> ? ? Running from numpy source directory.non-existing path in > >> 'numpy/distutils': 'site.cfg' > > > Looks like a new one. Does a normal "python setup.py" in your activated > > virtualenv work? If not, something is wrong in your environment. If it does > > work, you have your workaround. > > Seehttp://mail.scipy.org/pipermail/numpy-discussion/2011-June/056544.html > for a workaround (at least to get "python setup.py install" working, I > haven't tried with pip/easy_install etc..) Thanks. I ended up compiling/installing a separate Python instance. L. > > Cheers, > Scott > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.orghttp://mail.scipy.org/mailman/listinfo/numpy-discussion From totonixsame at gmail.com Mon Jun 13 15:55:46 2011 From: totonixsame at gmail.com (Thiago Franco Moraes) Date: Mon, 13 Jun 2011 16:55:46 -0300 Subject: [Numpy-discussion] Adjacent matrix Message-ID: Hi all, I'm reproducing a algorithm from a paper. This paper takes as input a binary volumetric matrix. In a step from this paper, from this binary volumetric matrix a adjacent matrix is calculated, thisadjacent matrix is calculated as bellow: "Find all of the grid points in that lie adjacent to one or more grid points of opposite values." This is the code used to calculate that matrix: int x,y,z,x_,y_,z_, addr,addr2; for(z = 1; z < nz-1; z++) for(y = 1; y < ny-1; y++) for(x = 1; x < nx-1; x++) for(z_=z-1; z_ <= z+1; z_++) for(y_=y-1; y_ <= y+1; y_++) for(x_=x-1; x_ <= x+1; x_++) { addr = z*nx*ny+y*nx+x; addr2 = z_*nx*ny+y_*nx+x_; if(im[addr] != im[addr2]) out[addr] = 1; } Where nx, ny and nz are the x, y, z dimensions from the "im" input binary matrix. I think this operation is called bwperim. Yes, I can reproduce that code in Python. But it will not have a great performance. My question is: Is there a way to do that using numpy or scipy tools? Thanks! PS: Those indexing starts with 1 because it is a matlab arrays however it is coded in C. From gael.varoquaux at normalesup.org Mon Jun 13 16:03:43 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 13 Jun 2011 22:03:43 +0200 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: References: Message-ID: <20110613200343.GH7860@phare.normalesup.org> Hi, You can probably find some inspiration from https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/feature_extraction/image.py Ga?l From bsouthey at gmail.com Mon Jun 13 20:05:41 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Jun 2011 19:05:41 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: <4DF6A5D5.8080801@gmail.com> On 06/13/2011 11:19 AM, Ralf Gommers wrote: > > > On Mon, Jun 13, 2011 at 6:08 PM, Bruce Southey > wrote: > > On 06/13/2011 10:11 AM, Derek Homeier wrote: > > I used wget using the direct link and it eventually got the complete > file after multiple tries. > > > Yes, SF is having a pretty bad day. > > > Anyhow, the tests passed for me on Linux 64-bit for Python 2.4, > 2.5, 2.7 > and 3.1. My Python 2.6 is not a standard build so I am avoiding > it. I do > get an error with Python3.2 > > That's an odd error. Did that work for you with 1.6.0? Could you > investigate a bit? > > Thanks, > Ralf > > > Bruce > > $ python3.2 -c "import numpy; numpy.test()" > Running unit tests for numpy > NumPy version 1.6.1rc1 > NumPy is installed in /usr/local/lib/python3.2/site-packages/numpy > Python version 3.2 (r32:88445, Mar 23 2011, 11:36:08) [GCC 4.5.1 > 20100924 (Red Hat 4.5.1-4)] > nose version 3.0.0 > .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................K............................................................................................................K.................................. > ...............................................................K......................K...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... > ............................................................................................................................................................................./usr/local/lib/python3.2/site-packages/numpy/lib/format.py:575: > ResourceWarning: unclosed file <_io.BufferedReader > name='/tmp/tmp66k88a'> > mode=mode, offset=offset) > ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................/usr/local/lib/python3.2/subprocess.py:460: > ResourceWarning: unclosed file <_io.BufferedReader name=3> > return Popen(*popenargs, **kwargs).wait() > /usr/local/lib/python3.2/subprocess.py:460: ResourceWarning: unclosed > file <_io.BufferedReader name=8> > return Popen(*popenargs, **kwargs).wait() > ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E........ > ====================================================================== > ERROR: Failure: OSError > (/usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: > cannot open shared object file: No such file or directory) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/failure.py", > line 37, in runTest > reraise(self.exc_class, self.exc_val, self.tb) > File > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/_3.py", > line 7, in reraise > raise exc_class(exc_val).with_traceback(tb) > File > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/loader.py", > line 389, in loadTestsFromName > addr.filename, addr.module) > File > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", > line 39, in importFromPath > return self.importFromDir(dir_path, fqname) > File > "/usr/local/lib/python3.2/site-packages/nose-3.0.0.dev-py3.2.egg/nose/importer.py", > line 86, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File > "/usr/local/lib/python3.2/site-packages/numpy/tests/test_ctypeslib.py", > line 8, in > cdll = load_library('multiarray', np.core.multiarray.__file__) > File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", > line 122, in load_library > raise exc > File "/usr/local/lib/python3.2/site-packages/numpy/ctypeslib.py", > line 119, in load_library > return ctypes.cdll[libpath] > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 415, in > __getitem__ > return getattr(self, name) > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 410, in > __getattr__ > dll = self._dlltype(name) > File "/usr/local/lib/python3.2/ctypes/__init__.py", line 340, in > __init__ > self._handle = _dlopen(self._name, mode) > OSError: > /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: > cannot > open shared object file: No such file or directory > > ---------------------------------------------------------------------- > Ran 3525 tests in 33.713s > > FAILED (KNOWNFAIL=5, errors=1) > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Both numpy 1.6.0rc3 and numpy 1.6.0 failed as well as a recent developmental version. So I will dig further into it when I get time. Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 13 20:53:57 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Tue, 14 Jun 2011 02:53:57 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: <4A3EECE0-4BB5-4073-A08C-76BE3546082F@astro.physik.uni-goettingen.de> On 13.06.2011, at 6:19PM, Ralf Gommers wrote: > >> I used wget using the direct link and it eventually got the complete >> file after multiple tries. >> > Yes, SF is having a pretty bad day. > I eventually also got the tarball through the fink/debian mirror list; no failures on OS X 10.5 i386. There are two different failures with Python3.x on ppc, gfortran 4.2.4, that were already present in 1.6.0 and still are on master however: >>> numpy.test('full') Running unit tests for numpy NumPy version 1.6.1rc1 NumPy is installed in /sw/lib/python3.2/site-packages/numpy Python version 3.2 (r32:88445, Mar 1 2011, 18:28:16) [GCC 4.0.1 (Apple Inc. build 5493)] nose version 1.0.0 ... ====================================================================== FAIL: test_return_character.TestF77ReturnCharacter.test_all ---------------------------------------------------------------------- Traceback (most recent call last): File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in runTest self.test(*self.arg) File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", line 78, in test_all self.check_function(getattr(self.module, name)) File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", line 12, in check_function r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line 34, in assert_ raise AssertionError(msg) AssertionError: b' ' ====================================================================== FAIL: test_return_character.TestF90ReturnCharacter.test_all ---------------------------------------------------------------------- Traceback (most recent call last): File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in runTest self.test(*self.arg) File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", line 136, in test_all self.check_function(getattr(self.module.f90_return_char, name)) File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", line 12, in check_function r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line 34, in assert_ raise AssertionError(msg) AssertionError: b' ' (both errors are raised on the first function tested, t0). Cheers, Derek From pav at iki.fi Mon Jun 13 21:31:18 2011 From: pav at iki.fi (Pauli Virtanen) Date: Tue, 14 Jun 2011 01:31:18 +0000 (UTC) Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: [clip] > OSError: > /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot > open shared object file: No such file or directory I think that's a result of Python 3.2 changing the extension module file naming scheme (IIRC to a versioned one). The '.pyd' ending and its Unix counterpart are IIRC hardcoded to the failing test, or some support code in Numpy. Clearly, they should instead ask Python how it names the extensions modules. This information may be available somewhere in the `sys` module. From bbelson at princeton.edu Mon Jun 13 23:18:27 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Mon, 13 Jun 2011 23:18:27 -0400 Subject: [Numpy-discussion] multiprocessing (shared memory) with numpy array multiplication Message-ID: Hi all, Thanks for your replies. > Brandt Belson wrote: > > Unfortunately I can't flatten the arrays. I'm writing a library where > > the user supplies an inner product function for two generic objects, and > > almost always the inner product function does large array > > multiplications at some point. The library doesn't get to know about the > > underlying arrays. > > Now I'm confused -- if the user is providing the inner product > implementation, how can you optimize that? Or are you trying to provide > said user with an optimized "large array multiplication" that he/she can > use? I'm sorry if I wasn't clear. I'm not providing a new array multiplication function. I'm taking the inner product function (which usually contains numpy array multiplication) from the user as a given. I am parallelizing the process of performing *many* inner products so that each core can do them independently. The parallelization is in performing many individual inner products, not within each inner product/array multiplication. > If so, then I'd post your implementation here, and folks can suggest > improvements. I did attach some code showing what I'm doing but that was a few days ago so I'll attach it again. > If it's regular old element-wise multiplication: > > a*b > > (where a and b are numpy arrays) > > then you are right, numpy isn't using any fancy multi-core aware > optimized ?package, so you should be able to make a faster version. > > You might try numexpr also -- it's pretty cool, though may not help for > a single operation. It might give you some ideas, though. > > http://www.scipy.org/SciPyPackages/NumExpr > > > -Chris NumExpr looks helpful and I'll definitely look into it, but the main issue is parallelizing many element-wise array multiplications, not speeding-up the array multiplication operation. It might be that parallelizing the individual inner products among cores isn't the right approach, but I'm not sure it's wrong yet. > > ? ? Message: 2 > > ? ? Date: Fri, 10 Jun 2011 09:23:10 -0400 > > ? ? From: Olivier Delalleau > > > ? ? Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > > ? ? ? ? ? ?with numpy array multiplication > > ? ? To: Discussion of Numerical Python > ? ? > > > ? ? Message-ID: > ? ? > > > ? ? Content-Type: text/plain; charset="iso-8859-1" > > > > ? ? It may not work for you depending on your specific problem > > ? ? constraints, but > > ? ? if you could flatten the arrays, then it would be a dot, and you > > ? ? could maybe > > ? ? compute multiple such dot products by storing those flattened arrays > > ? ? into a > > ? ? matrix. > > > > ? ? -=- Olivier > > > > ? ? 2011/6/10 Brandt Belson > ? ? > > > > > ? ? ?> Hi, > > ? ? ?> Thanks for getting back to me. > > ? ? ?> I'm doing element wise multiplication, basically innerProduct = > > ? ? ?> numpy.sum(array1*array2) where array1 and array2 are, in general, > > ? ? ?> multidimensional. I need to do many of these operations, and I'd > > ? ? like to > > ? ? ?> split up the tasks between the different cores. I'm not using > > ? ? numpy.dot, if > > ? ? ?> I'm not mistaken I don't think that would do what I need. > > ? ? ?> Thanks again, > > ? ? ?> Brandt > > ? ? ?> > > ? ? ?> > > ? ? ?> Message: 1 > > ? ? ?>> Date: Thu, 09 Jun 2011 13:11:40 -0700 > > ? ? ?>> From: Christopher Barker > ? ? > > > ? ? ?>> Subject: Re: [Numpy-discussion] Using multiprocessing (shared > > ? ? memory) > > ? ? ?>> ? ? ? ?with numpy array multiplication > > ? ? ?>> To: Discussion of Numerical Python > ? ? > > > ? ? ?>> Message-ID: <4DF128FC.8000807 at noaa.gov > > ? ? > > > ? ? ?>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > ? ? ?>> > > ? ? ?>> Not much time, here, but since you got no replies earlier: > > ? ? ?>> > > ? ? ?>> > > ? ? ?>> > ? ? ?> I'm parallelizing some code I've written using the built in > > ? ? ?>> > ? ? multiprocessing > > ? ? ?>> > ? ? ?> module. In my application, I need to multiply many > > ? ? large arrays > > ? ? ?>> > ? ? together > > ? ? ?>> > > ? ? ?>> is the matrix multiplication, or element-wise? If matrix, then numpy > > ? ? ?>> should be using LAPACK, which, depending on how its built, could be > > ? ? ?>> using all your cores already. This is heavily dependent on your your > > ? ? ?>> numpy (really the LAPACK it uses0 is built. > > ? ? ?>> > > ? ? ?>> > ? ? ?> and > > ? ? ?>> > ? ? ?> sum the resulting product arrays (inner products). > > ? ? ?>> > > ? ? ?>> are you using numpy.dot() for that? If so, then the above applies to > > ? ? ?>> that as well. > > ? ? ?>> > > ? ? ?>> I know I could look at your code to answer these questions, but I > > ? ? ?>> thought this might help. > > ? ? ?>> > > ? ? ?>> -Chris > > ? ? ?>> > > ? ? ?>> > > ? ? ?>> > > ? ? ?>> > > ? ? ?>> > > ? ? ?>> -- > > ? ? ?>> Christopher Barker, Ph.D. > > ? ? ?>> Oceanographer > > ? ? ?>> > > ? ? ?>> Emergency Response Division > > ? ? ?>> NOAA/NOS/OR&R ? ? ? ? ? ?(206) 526-6959 > > ? ? ? voice > > ? ? ?>> 7600 Sand Point Way NE ? (206) 526-6329 > > ? ? ? fax > > ? ? ?>> Seattle, WA ?98115 ? ? ? (206) 526-6317 > > ? ? ? main reception > > ? ? ?>> > > ? ? ?>> Chris.Barker at noaa.gov > Message: 2 > Date: Mon, 13 Jun 2011 12:51:08 -0500 > From: srean > Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) > ? ? ? ?with numpy array multiplication > To: Discussion of Numerical Python > Message-ID: > Content-Type: text/plain; charset=ISO-8859-1 > > Looking at the code the arrays that you are multiplying seem fairly > small (300, 200) and you have 50 of them. So it might the case that > there is not enough computational work to compensate for the cost of > forking new processes and communicating the results. Have you tried > larger arrays and more of them ? I've tried varying the sizes and the trends are consistent - using multiprocessing on numpy array multiplication is slower than not using it. For reference, I'm on a mac with the following numpy configuration: >>> print numpy.show_config() lapack_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-faltivec'] define_macros = [('NO_ATLAS_INFO', 3)] blas_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-faltivec', '-I/System/Library/Frameworks/vecLib.framework/Headers'] define_macros = [('NO_ATLAS_INFO', 3)] None > If you are on an intel machine and you have MKL libraries around I > would strongly recommend that you use the matrix multiplication > routine if possible. MKL will do the parallelization for you. Well, > any good BLAS implementation would do the same, you dont really need > MKL. ATLAS and ACML would work too, just that MKL has been setup for > us and it works well. > > To give an idea, given the amount of tuning and optimization that > these libraries have undergone a numpy.sum would be slower that an > multiplication with a vector of all ones. So in the interest of speed > the longer you stay in the BLAS context the better. > > --srean That seems like a good option. While I'd like the user to have minimal restrictions and dependencies to consider when writing the inner product function, maybe I should put the burden on them to parallelize the inner products, which could be simply done by configuring numpy with MKL I guess (I haven't tried this yet). I'm still a bit curious what is causing my script to be slower when the multiple inner products are parallelized. Thanks, Brandt -------------- next part -------------- A non-text attachment was scrubbed... Name: myutil.py Type: application/octet-stream Size: 384 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: shared_mem.py Type: application/octet-stream Size: 1522 bytes Desc: not available URL: From bsouthey at gmail.com Mon Jun 13 23:28:11 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Jun 2011 22:28:11 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: On Mon, Jun 13, 2011 at 8:31 PM, Pauli Virtanen wrote: > On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: > [clip] >> OSError: >> /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot >> open shared object file: No such file or directory > > I think that's a result of Python 3.2 changing the extension module > file naming scheme (IIRC to a versioned one). > > The '.pyd' ending and its Unix counterpart are IIRC hardcoded to the > failing test, or some support code in Numpy. Clearly, they should instead > ask Python how it names the extensions modules. This information may be > available somewhere in the `sys` module. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Pauli, I am impressed yet again! It really saved a lot of time to understand the reason. A quick comparison between Python versions: Python2.7: multiarray.so Python3.2: multiarray.cpython-32m.so Search for the extension leads to PEP 3149 http://www.python.org/dev/peps/pep-3149/ This is POSIX only which excludes windows. I tracked it down to the list 'libname_ext' defined in file "numpy/ctypeslib.py" line 100: libname_ext = ['%s.so' % libname, '%s.pyd' % libname] On my 64-bit Linux system, I get the following results: $ python2.4 -c "from distutils import sysconfig; print sysconfig.get_config_var('SO')" .so $ python2.5 -c "from distutils import sysconfig; print sysconfig.get_config_var('SO')" .so $ python2.7 -c "from distutils import sysconfig; print sysconfig.get_config_var('SO')" .so $ python3.1 -c "from distutils import sysconfig; print(sysconfig.get_config_var('SO'))" .so $ python3.2 -c "from distutils import sysconfig; print(sysconfig.get_config_var('SO'))" .cpython-32m.so My Windows 32-bit Python2.6 install: >>> from distutils import sysconfig >>> sysconfig.get_config_var('SO') '.pyd' Making the bug assumption that other OSes including 32-bit and 64-bit versions also provide the correct suffix, this suggest adding the import and modification as: import from distutils import sysconfig libname_ext = ['%s%s' % (libname, sysconfig.get_config_var('SO'))] Actually if that works for Mac, Sun etc. then 'load_library' function could be smaller. Finally the test_basic2 in "tests/test_ctypeslib.py" needs to be changed as well because it is no longer correct. Just commenting out the 'fix' appears to work. Bruce From scott.sinclair.za at gmail.com Tue Jun 14 03:32:46 2011 From: scott.sinclair.za at gmail.com (Scott Sinclair) Date: Tue, 14 Jun 2011 09:32:46 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> Message-ID: On 13 June 2011 17:11, Derek Homeier wrote: > you're right - I've tried to download the tarball, but am getting connection errors or incomplete > downloads from all available SF mirrors, and apparently I was still too thick to figure out how > to checkout a specific tag... I find the cleanest way is to checkout the tag in a new branch: $ git checkout -b release/v1.6.1rc1 v1.6.1rc1 Switched to a new branch 'release/v1.6.1rc1' Cheers, Scott From lpc at cmu.edu Tue Jun 14 07:06:02 2011 From: lpc at cmu.edu (Luis Pedro Coelho) Date: Tue, 14 Jun 2011 07:06:02 -0400 Subject: [Numpy-discussion] Adjacent matrix Message-ID: <201106140706.03005.lpc@cmu.edu> On Monday, June 13, 2011 03:55:46 PM Thiago Franco Moraes wrote: > "Find all of the grid points in that lie adjacent to one or more grid > points of opposite values." > > This is the code used to calculate that matrix: [spin] > Where nx, ny and nz are the x, y, z dimensions from the "im" input > binary matrix. I think this operation is called bwperim. My package, mahotas, supports it directly: mahotas.bwperim(bw, n) at http://packages.python.org/mahotas/ HTH, Luis -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: From totonixsame at gmail.com Tue Jun 14 08:50:47 2011 From: totonixsame at gmail.com (Thiago Franco Moraes) Date: Tue, 14 Jun 2011 09:50:47 -0300 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: <201106140706.03005.lpc@cmu.edu> References: <201106140706.03005.lpc@cmu.edu> Message-ID: On Tue, Jun 14, 2011 at 8:06 AM, Luis Pedro Coelho wrote: > On Monday, June 13, 2011 03:55:46 PM Thiago Franco Moraes wrote: >> "Find all of the grid points in that lie adjacent to one or more grid >> points of opposite values." >> >> This is the code used to calculate that matrix: > [spin] >> Where nx, ny and nz are the x, y, z dimensions from the "im" input >> binary matrix. I think this operation is called bwperim. > > My package, mahotas, supports it directly: > > mahotas.bwperim(bw, n) > > at > http://packages.python.org/mahotas/ > > HTH, > Luis Thanks Luis. But mahotas.bwperim only supports 2D images, my images are 3D from Computerized Tomography (often very large) and I need connectivity 26. From totonixsame at gmail.com Tue Jun 14 09:29:38 2011 From: totonixsame at gmail.com (Thiago Franco Moraes) Date: Tue, 14 Jun 2011 10:29:38 -0300 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: <20110613200343.GH7860@phare.normalesup.org> References: <20110613200343.GH7860@phare.normalesup.org> Message-ID: On Mon, Jun 13, 2011 at 5:03 PM, Gael Varoquaux wrote: > Hi, > > You can probably find some inspiration from > https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/feature_extraction/image.py > > Ga?l Hi Ga?l, I don't know if I understand. The idea is to use _make_edges_3d to give me the connectivity, isn't it? Like for example, a 3x3 image: 0, 1, 2 3, 4, 5 6, 7, 8 If I use _make_edges_3d(3, 3) I get: array([[0, 1, 3, 4, 6, 7, 0, 1, 2, 3, 4, 5], [1, 2, 4, 5, 7, 8, 3, 4, 5, 6, 7, 8]]) So, for example, the pixel 4 is connected with 3, 5, 7 and 1 (connectivity 4). Is it? The problem is I need to work with connectivity 26 in 3D images, or 8 in 2D images. I'm not able to understand the code so well to apply to those connectivity I said (8 and 26). Do you think it is possible? Thanks Ga?l, > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From lpc at cmu.edu Tue Jun 14 10:09:31 2011 From: lpc at cmu.edu (Luis Pedro Coelho) Date: Tue, 14 Jun 2011 10:09:31 -0400 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: References: <201106140706.03005.lpc@cmu.edu> Message-ID: <201106141009.36271.lpc@cmu.edu> On Tuesday, June 14, 2011 08:50:47 AM Thiago Franco Moraes wrote: > On Tue, Jun 14, 2011 at 8:06 AM, Luis Pedro Coelho wrote: > > On Monday, June 13, 2011 03:55:46 PM Thiago Franco Moraes wrote: > >> "Find all of the grid points in that lie adjacent to one or more grid > >> points of opposite values." > > > >> This is the code used to calculate that matrix: > > [spin] > > > >> Where nx, ny and nz are the x, y, z dimensions from the "im" input > >> binary matrix. I think this operation is called bwperim. > > > > My package, mahotas, supports it directly: > > > > mahotas.bwperim(bw, n) > > > > at > > http://packages.python.org/mahotas/ > > > > HTH, > > Luis > > Thanks Luis. But mahotas.bwperim only supports 2D images, my images > are 3D from Computerized Tomography (often very large) and I need > connectivity 26. If the images are binary, an alternative is mahotas.labeled.borders() It returns the pixels which have more than one value in there neighbourhoods. HTH Luis -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: From totonixsame at gmail.com Tue Jun 14 12:22:46 2011 From: totonixsame at gmail.com (Thiago Franco Moraes) Date: Tue, 14 Jun 2011 13:22:46 -0300 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: <201106141009.36271.lpc@cmu.edu> References: <201106140706.03005.lpc@cmu.edu> <201106141009.36271.lpc@cmu.edu> Message-ID: On Tue, Jun 14, 2011 at 11:09 AM, Luis Pedro Coelho wrote: > On Tuesday, June 14, 2011 08:50:47 AM Thiago Franco Moraes wrote: >> On Tue, Jun 14, 2011 at 8:06 AM, Luis Pedro Coelho wrote: >> > On Monday, June 13, 2011 03:55:46 PM Thiago Franco Moraes wrote: >> >> "Find all of the grid points in that lie adjacent to one or more grid >> >> points of opposite values." >> > >> >> This is the code used to calculate that matrix: >> > [spin] >> > >> >> Where nx, ny and nz are the x, y, z dimensions from the "im" input >> >> binary matrix. I think this operation is called bwperim. >> > >> > My package, mahotas, supports it directly: >> > >> > mahotas.bwperim(bw, n) >> > >> > at >> > http://packages.python.org/mahotas/ >> > >> > HTH, >> > Luis >> >> Thanks Luis. But mahotas.bwperim only supports 2D images, my images >> are 3D from Computerized Tomography (often very large) and I need >> connectivity 26. > > If the images are binary, an alternative is mahotas.labeled.borders() > > It returns the pixels which have more than one value in ?there neighbourhoods. > > HTH > Luis Thanks again Luis. I have to add the version which works with 3D images is from git (https://github.com/luispedro/mahotas/), the one installed with pip only works with 2D images. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From mwwiebe at gmail.com Tue Jun 14 16:37:36 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 14 Jun 2011 15:37:36 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: On Mon, Jun 13, 2011 at 7:58 AM, Ralf Gommers wrote: > Hi, > > I am pleased to announce the availability of the first release candidate of > NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > #1834 einsum fails for specific shapes > #1837 einsum throws nan or freezes python for specific array shapes > #1838 object <-> structured type arrays regression > #1851 regression for SWIG based code in 1.6.0 > #1863 Buggy results when operating on array copied with astype() > > If no problems are reported, the final release will be in one week. Sources > and binaries can be found at > https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ It would be nice to get a fix to this newly reported regression in: http://projects.scipy.org/numpy/ticket/1867 I'll look into it within a few days. -Mark > > > Enjoy, > Ralf > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 14 16:51:15 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 14 Jun 2011 15:51:15 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <8C3B05E3-FC83-4D29-B428-8F61BA6C94A7@gmail.com> References: <842A1EE7-027A-4BDB-996B-B1D8B6738D97@gmail.com> <20D9BDDD-1CF4-4B73-BDD5-2D77A7AF4D01@gmail.com> <8C3B05E3-FC83-4D29-B428-8F61BA6C94A7@gmail.com> Message-ID: On Fri, Jun 10, 2011 at 7:04 PM, Pierre GM wrote: > > On Jun 11, 2011, at 1:03 AM, Mark Wiebe wrote: > > > > I don't think you would want to extend the datetime with more metadata, > but rather use it as a tool to create the timeseries with. You could create > a lightweight wrapper around datetime arrays which exposed a > timeseries-oriented interface but used the datetime functionality to do the > computations and provide good performance with large numbers of datetimes. I > would love it if you have the time to experiment a bit with the > https://github.com/m-paradox/numpy branch I'm developing on, so I could > tweak the design or add features to make this easier based on your feedback > from using the new data type. > > I'll do my best, depending on my free time, alas... Did you have a chance > to check scikits.timeseries (and its experimental branch) ? If you have any > question about it, feel free to contact me offlist (less noise that way, we > can always summarize the discussion). > I've taken a look through the documentation and some of the code, but am finding it a bit difficult to see the motivation of all the pieces. Probably the examples Dave Hirschfeld put in his emails are the most illuminating thing I've seen so far, as well as some birds-eye view explanations of what time series are within statistics. It seems to me that what the time series analysis needs is operations for functions sampled at particular points in time or aggregated over particular intervals of time. With the datetime dtype in place, it should be possible to have one code which does that for any of floats, integers, or datetimes as the domain, and I think removing the multiplier in dtypes like 'M8[10s]' and having that be represented as time intervals might be more flexible and still do everything the current scikits.timeseries can do. -Mark > Cheers > P. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Jun 14 17:30:00 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 14 Jun 2011 23:30:00 +0200 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: References: <20110613200343.GH7860@phare.normalesup.org> Message-ID: <20110614213000.GG10377@phare.normalesup.org> On Tue, Jun 14, 2011 at 10:29:38AM -0300, Thiago Franco Moraes wrote: > I don't know if I understand. The idea is to use _make_edges_3d to > give me the connectivity, isn't it? Like for example, a 3x3 image: > 0, 1, 2 > 3, 4, 5 > 6, 7, 8 > If I use _make_edges_3d(3, 3) I get: > array([[0, 1, 3, 4, 6, 7, 0, 1, 2, 3, 4, 5], > [1, 2, 4, 5, 7, 8, 3, 4, 5, 6, 7, 8]]) > So, for example, the pixel 4 is connected with 3, 5, 7 and 1 > (connectivity 4). Is it? Yes, you go the idea. > The problem is I need to work with connectivity 26 in 3D images, or 8 > in 2D images. I'm not able to understand the code so well to apply to > those connectivity I said (8 and 26). The code is for 4 connectivity (or 6 in 3D). You will need to modify the _make_edges_3d function to get the connectivity you want. Hopefully this will get the ball running, and by looking at the code you will figure out how to modify it. HTH, Ga?l From pav at iki.fi Tue Jun 14 18:30:00 2011 From: pav at iki.fi (Pauli Virtanen) Date: Tue, 14 Jun 2011 22:30:00 +0000 (UTC) Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 References: Message-ID: On Tue, 14 Jun 2011 15:37:36 -0500, Mark Wiebe wrote: [clip] > It would be nice to get a fix to this newly reported regression in: > > http://projects.scipy.org/numpy/ticket/1867 I see that regression only in master, not in 1.6.x, so I don't think it will delay 1.6.1. Pauli From mwwiebe at gmail.com Tue Jun 14 19:19:45 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 14 Jun 2011 18:19:45 -0500 Subject: [Numpy-discussion] fixing up datetime In-Reply-To: <20110613115930.03749905@ajackson.org> References: <20110613115930.03749905@ajackson.org> Message-ID: On Mon, Jun 13, 2011 at 11:59 AM, wrote: > I'm joining this late (I've been traveling), but it might be useful to > look at the fairly new R module "lubridate". They have put quite some > thought into simplifying date handling, and when I have used it I have > generally been quite pleased. The documentation is quite readable. > > Just Google it and it will be at the top. > Cool, thanks for the link! I've read through the "made easy" pdf, here are some points about it: - It appears to gloss over how they resolve the ambiguities when adding months and years, or days in local time zones across daylight savings boundaries. The examples they give are when it's not ambiguous, like adding a year to January 1st, or adding a second when the daylight savings time time springs forward. They do mention returning NA in that spring forward gap, but don't mention what they do in the fall back hour when the same clock time represents two different moments an hour apart. - They mention the case timedelta / timedelta, which hadn't been covered in the NEP, and produces a unitless scalar, that's something to add to NumPy. - For NumPy, I'm planning to just support UTC time, with some functions to provide local timezone manipulations which use the OS's timezone setting. Lubridate appears to have a timezone attached to the date. - Lubridate distinguishes between "durations" and "periods", where durations are in fixed time units and periods are relative to the date, like months and years. The NumPy approach I've taken is for conversions between these types of units to require unsafe casting and/or fail during type promotion. - They use accessor functions for the year, month, day, etc. components of the datetime. I'm thinking this could be done in NumPy by extending the structured dtype idea with "derived fields", which look like structured dtype fields, but are computed from the value instead of stored directly. This idea isn't fully worked out yet, however. -Mark > >Hey all, > > > >So I'm doing a summer internship at Enthought, and the first thing they > >asked me to look into is finishing the datetime type in numpy. It turns > out > >that the estimates of how complete the type was weren't accurate, and to > >support what the NEP describes required generalizing the ufunc type > >resolution system. I also found that the date/time parsing code (based on > >mxDateTime) was not robust, producing something for almost any arbitrary > >garbage input. I've replaced much of the broken code and implemented a lot > >of the functionality, and thought this might be a good point to do a pull > >request on what I've got and get feedback on the issues I've run into. > > > >* The existing datetime-related API is probably not useful, and in fact > >those functions aren't used internally anymore. Is it reasonable to remove > >the functions, or do we just deprecate them? > > > >* Leap seconds probably deserve a rigorous treatment, but having an > internal > >representation with leap-seconds overcomplicates otherwise very simple and > >fast operations. Could we internally use a value matching TAI or GPS time? > >Currently it's a UTC time in the present, but the 1970 epoch is then not > the > >UTC 1970 epoch, but 10s of seconds off, and this isn't properly specified. > >What are people's opinions? The Python datetime.datetime doesn't support > >leap seconds (seconds == 60 is disallowed). > > > >* Default conversion to string - should it be in UTC or with the local > >timezone baked in? As UTC it may be confusing because 'now' will print as > a > >different time than people would expect. > > > >* Business days - The existing business idea doesn't seem very useful, > >representing just the western M-F work week and not accounting for > holidays. > >I've come up with a design which might address these issues: Extend the > >metadata for business days with a string identifier, like 'M8[B:USA]', > then > >have a global internal dictionary which maps 'USA' to a workweek mask and > a > >list of holidays. The call to prepare this dictionary for a particular > >business day type might look like np.set_business_days('USA', [1, 1, 1, 1, > >1, 0, 0], np.array([ list of US holidays ], dtype='M8[D]')). Internally, > >business days would be stored the same as regular days, but with special > >treatment where landing on a weekend or holiday gives you back a NaT. > > > >If you are interested in the business day functionality, please comment on > >this design! > > > >* The dtype constructor accepted 'O#' for object types, something I think > >was wrong. I've removed that, but allow # to be 4 or 8, producing a > >deprecation warning if it occurs. > > > >* Would it make sense to offset the week-based datetime's epoch so it > aligns > >with ISO 8601's week format? Jan 1, 1970 is a thursday, but the YYYY-Www > >date format uses weeks starting on monday. I think producing strings in > this > >format when the datetime has units of weeks would be a natural thing to > do. > > > >* Should the NaT (not-a-time) value behave like floating-point NaN? i.e. > NaT > >== NaT return false, etc. Should operations generating NaT trigger an > >'invalid' floating point exception in ufuncs? > > > >Cheers, > >Mark > > > -- > ----------------------------------------------------------------------- > | Alan K. Jackson | To see a World in a Grain of Sand | > | alan at ajackson.org | And a Heaven in a Wild Flower, | > | www.ajackson.org | Hold Infinity in the palm of your hand | > | Houston, Texas | And Eternity in an hour. - Blake | > ----------------------------------------------------------------------- > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 14 19:34:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 14 Jun 2011 18:34:44 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API Message-ID: These functions are now fully implemented and documented. As always, code reviews are welcome here: https://github.com/numpy/numpy/pull/87 and for those that don't want to dig into review C code, the commit for the documentation is here: https://github.com/m-paradox/numpy/commit/6b5a42a777b16812e774193b06da1b68b92bc689 This is probably also another good place to do a merge to master, so if people could test it on Mac/Windows/other platforms that would be much appreciated. Thanks, Mark On Fri, Jun 10, 2011 at 5:49 PM, Mark Wiebe wrote: > I've implemented the busday_offset function with support for the weekmask > and roll parameters, the commits are tagged 'datetime-bday' in the pull > request here: > > https://github.com/numpy/numpy/pull/87 > > -Mark > > > On Thu, Jun 9, 2011 at 5:23 PM, Mark Wiebe wrote: > >> Here's a possible design for a business day API for numpy datetimes: >> >> >> The 'B' business day unit will be removed. All business day-related >> calculations will be done using the 'D' day unit. >> >> A class *BusinessDayDef* to encapsulate the definition of the business >> week and holidays. The business day functions will either take one of these >> objects, or separate weekmask and holidays parameters, to specify the >> business day definition. This class serves as both a performance >> optimization and a way to encapsulate the weekmask and holidays together, >> for example if you want to make a dictionary mapping exchange names to their >> trading days definition. >> >> The weekmask can be specified in a number of ways, and internally becomes >> a boolean array with 7 elements with True for the days Monday through Sunday >> which are valid business days. Some different notations are for the 5-day >> week include [1,1,1,1,1,0,0], "1111100" "MonTueWedThuFri". The holidays are >> always specified as a one-dimensional array of dtype 'M8[D]', and are >> internally used in sorted form. >> >> >> A function *is_busday*(datearray, weekmask=, holidays=, busdaydef=) >> returns a boolean array matching the input datearray, with True for the >> valid business days. >> >> A function *busday_offset*(datearray, offsetarray, >> roll='raise', weekmask=, holidays=, busdaydef=) which first applies the >> 'roll' policy to start at a valid business date, then offsets the date by >> the number of business days specified in offsetarray. The arrays datearray >> and offsetarray are broadcast together. The 'roll' parameter can be >> 'forward'/'following', 'backward'/'preceding', 'modifiedfollowing', >> 'modifiedpreceding', or 'raise' (the default). >> >> A function *busday_count*(datearray1, datearray2, weekmask=, holidays=, >> busdaydef=) which calculates the number of business days between datearray1 >> and datearray2, not including the day of datearray2. >> >> >> For example, to find the first Monday in Feb 2011, >> >> >>>np.busday_offset('2011-02', 0, roll='forward', weekmask='Mon') >> >> or to find the number of weekdays in Feb 2011, >> >> >>>np.busday_count('2011-02', '2011-03') >> >> This set of three functions appears to be powerful enough to express the >> business-day computations that I've been shown thus far. >> >> Cheers, >> Mark >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From totonixsame at gmail.com Wed Jun 15 08:31:40 2011 From: totonixsame at gmail.com (Thiago Franco Moraes) Date: Wed, 15 Jun 2011 09:31:40 -0300 Subject: [Numpy-discussion] Adjacent matrix In-Reply-To: <20110614213000.GG10377@phare.normalesup.org> References: <20110613200343.GH7860@phare.normalesup.org> <20110614213000.GG10377@phare.normalesup.org> Message-ID: On Tue, Jun 14, 2011 at 6:30 PM, Gael Varoquaux wrote: > On Tue, Jun 14, 2011 at 10:29:38AM -0300, Thiago Franco Moraes wrote: >> I don't know if I understand. The idea is to use _make_edges_3d to >> give me the connectivity, isn't it? Like for example, a 3x3 image: > >> 0, 1, 2 >> 3, 4, 5 >> 6, 7, 8 > >> If I use _make_edges_3d(3, 3) I get: > >> array([[0, 1, 3, 4, 6, 7, 0, 1, 2, 3, 4, 5], >> ? ? ? ?[1, 2, 4, 5, 7, 8, 3, 4, 5, 6, 7, 8]]) > >> So, for example, the pixel 4 is connected with 3, 5, 7 and 1 >> (connectivity 4). Is it? > > Yes, you go the idea. > >> The problem is I need to work with connectivity 26 in 3D images, or 8 >> in 2D images. I'm not able to understand the code so well to apply to >> those connectivity I said (8 and 26). > > The code is for 4 connectivity (or 6 in 3D). You will need to modify the > _make_edges_3d function to get the connectivity you want. Hopefully this > will get the ball running, and by looking at the code you will figure out > how to modify it. > > HTH, > > Ga?l Thanks Ga?l. To resolve my problem I'm using mahotas.labeled.borders which was proposed by Luis. However, the ideas from _make_edges_3d can be useful to me in other problems I'm having. Thanks again. From jonathan.taylor at utoronto.ca Wed Jun 15 11:12:52 2011 From: jonathan.taylor at utoronto.ca (Jonathan Taylor) Date: Wed, 15 Jun 2011 11:12:52 -0400 Subject: [Numpy-discussion] Behaviour of ndarray and other objects with __radd__? Message-ID: Hi, I would like to have objects that I can mix with ndarrays in arithmetic expressions but I need my object to have control of the operation even when it is on the right hand side of the equation. I realize from the documentation that the way to do this is to actually subclass ndarray but this is undesirable because I do not need all the heavy machinery of a ndarray and I do not want users to see all of the ndarray methods. Is there a way to somehow achieve these goals? I would also very much appreciate some clarification of what is happening in the following basic example: import numpy as np class Foo(object): # THE NEXT LINE IS COMMENTED # __array_priority__ = 0 def __add__(self, other): print 'Foo has control over', other return 1 def __radd__(self, other): print 'Foo has control over', other return 1 x = np.arange(3) f = Foo() print f + x print x + f yields Foo has control over [0 1 2] 1 Foo has control over 0 Foo has control over 1 Foo has control over 2 [1 1 1] I see that I have control from the left side as expected and I suspect that what is happening in the second case is that numpy is trying to "broadcast" my object onto the left side as if it was an object array? Now if I uncomment the line __array_priority__ = 0 I do seem to accomplish my goals (see below) but I am not sure why. I am surprised, given what I have read in the documentation, that __array_priority__ does anything in a non subclass of ndarray. Furthermore, I am even more surprised that it does anything when it is 0, which is the same as ndarray.__array_priority__ from what I understand. Any clarification of this would be greatly appreciated. Output with __array_priority__ uncommented: jtaylor at yukon:~$ python foo.py Foo has control over [0 1 2] 1 Foo has control over [0 1 2] 1 Thanks, Jonathan. From shish at keba.be Wed Jun 15 11:34:22 2011 From: shish at keba.be (Olivier Delalleau) Date: Wed, 15 Jun 2011 11:34:22 -0400 Subject: [Numpy-discussion] Behaviour of ndarray and other objects with __radd__? In-Reply-To: References: Message-ID: I don't really understand this behavior either, but juste note that according to http://docs.scipy.org/doc/numpy/user/c-info.beyond-basics.html "This attribute can also be defined by objects that are not sub-types of the ndarray" -=- Olivier 2011/6/15 Jonathan Taylor > Hi, > > I would like to have objects that I can mix with ndarrays in > arithmetic expressions but I need my object to have control of the > operation even when it is on the right hand side of the equation. I > realize from the documentation that the way to do this is to actually > subclass ndarray but this is undesirable because I do not need all the > heavy machinery of a ndarray and I do not want users to see all of the > ndarray methods. Is there a way to somehow achieve these goals? > > I would also very much appreciate some clarification of what is > happening in the following basic example: > > import numpy as np > class Foo(object): > # THE NEXT LINE IS COMMENTED > # __array_priority__ = 0 > def __add__(self, other): > print 'Foo has control over', other > return 1 > def __radd__(self, other): > print 'Foo has control over', other > return 1 > > x = np.arange(3) > f = Foo() > > print f + x > print x + f > > yields > > Foo has control over [0 1 2] > 1 > Foo has control over 0 > Foo has control over 1 > Foo has control over 2 > [1 1 1] > > I see that I have control from the left side as expected and I suspect > that what is happening in the second case is that numpy is trying to > "broadcast" my object onto the left side as if it was an object array? > > Now if I uncomment the line __array_priority__ = 0 I do seem to > accomplish my goals (see below) but I am not sure why. I am > surprised, given what I have read in the documentation, that > __array_priority__ does anything in a non subclass of ndarray. > Furthermore, I am even more surprised that it does anything when it is > 0, which is the same as ndarray.__array_priority__ from what I > understand. Any clarification of this would be greatly appreciated. > > Output with __array_priority__ uncommented: > > jtaylor at yukon:~$ python foo.py > Foo has control over [0 1 2] > 1 > Foo has control over [0 1 2] > 1 > > Thanks, > Jonathan. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 15 11:46:50 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 15 Jun 2011 09:46:50 -0600 Subject: [Numpy-discussion] Behaviour of ndarray and other objects with __radd__? In-Reply-To: References: Message-ID: On Wed, Jun 15, 2011 at 9:34 AM, Olivier Delalleau wrote: > I don't really understand this behavior either, but juste note that > according to > http://docs.scipy.org/doc/numpy/user/c-info.beyond-basics.html > "This attribute can also be defined by objects that are not sub-types of > the ndarray" > > -=- Olivier > > > 2011/6/15 Jonathan Taylor > >> Hi, >> >> I would like to have objects that I can mix with ndarrays in >> arithmetic expressions but I need my object to have control of the >> operation even when it is on the right hand side of the equation. I >> realize from the documentation that the way to do this is to actually >> subclass ndarray but this is undesirable because I do not need all the >> heavy machinery of a ndarray and I do not want users to see all of the >> ndarray methods. Is there a way to somehow achieve these goals? >> >> I would also very much appreciate some clarification of what is >> happening in the following basic example: >> >> import numpy as np >> class Foo(object): >> # THE NEXT LINE IS COMMENTED >> # __array_priority__ = 0 >> def __add__(self, other): >> print 'Foo has control over', other >> return 1 >> def __radd__(self, other): >> print 'Foo has control over', other >> return 1 >> >> x = np.arange(3) >> f = Foo() >> >> print f + x >> print x + f >> >> yields >> >> Foo has control over [0 1 2] >> 1 >> Foo has control over 0 >> Foo has control over 1 >> Foo has control over 2 >> [1 1 1] >> >> I see that I have control from the left side as expected and I suspect >> that what is happening in the second case is that numpy is trying to >> "broadcast" my object onto the left side as if it was an object array? >> >> Now if I uncomment the line __array_priority__ = 0 I do seem to >> accomplish my goals (see below) but I am not sure why. I am >> surprised, given what I have read in the documentation, that >> __array_priority__ does anything in a non subclass of ndarray. >> Furthermore, I am even more surprised that it does anything when it is >> 0, which is the same as ndarray.__array_priority__ from what I >> understand. Any clarification of this would be greatly appreciated. >> >> There's a bit of code in the ufunc implementation that checks for the __array_priority__ attribute regardless of if the object subclasses ndarray, probably because someone once needed to solve the same problem you are having. The comment that goes with it is /* * FAIL with NotImplemented if the other object has * the __r__ method and has __array_priority__ as * an attribute (signalling it can handle ndarray's) * and is not already an ndarray or a subtype of the same type. */ Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From chaoyuejoy at gmail.com Wed Jun 15 12:10:52 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Wed, 15 Jun 2011 18:10:52 +0200 Subject: [Numpy-discussion] what python module to handle csv? Message-ID: Dear all pythoners, what do you use python module to handle csv file (for reading we can use numpy.genfromtxt)? is there anyone we can do with csv file very convinient as that in R? can numpy.genfromtxt be used as writing? (I didn't try this yet because on our server we have only numpy 1.0.1...). This really make me struggling since csv is a very important interface file (I think so...). Thanks a lot, Sincerely, Chao -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Wed Jun 15 12:26:25 2011 From: shish at keba.be (Olivier Delalleau) Date: Wed, 15 Jun 2011 12:26:25 -0400 Subject: [Numpy-discussion] what python module to handle csv? In-Reply-To: References: Message-ID: Using savetxt with delimiter=',' should do the trick. If you want a more advanced csv interface to e.g. save more than a numpy array into a single csv, you can probably look into the python csv module. -=- Olivier 2011/6/15 Chao YUE > Dear all pythoners, > > what do you use python module to handle csv file (for reading we can use > numpy.genfromtxt)? is there anyone we can do with csv file very convinient > as that in R? > can numpy.genfromtxt be used as writing? (I didn't try this yet because on > our server we have only numpy 1.0.1...). This really make me struggling > since csv is a very > important interface file (I think so...). > > Thanks a lot, > > Sincerely, > > Chao > > -- > > *********************************************************************************** > Chao YUE > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > UMR 1572 CEA-CNRS-UVSQ > Batiment 712 - Pe 119 > 91191 GIF Sur YVETTE Cedex > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > > ************************************************************************************ > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla at molden.no Wed Jun 15 13:15:09 2011 From: sturla at molden.no (Sturla Molden) Date: Wed, 15 Jun 2011 19:15:09 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: <4DF8E89D.7080904@molden.no> Den 13.06.2011 19:51, skrev srean: > If you are on an intel machine and you have MKL libraries around I > would strongly recommend that you use the matrix multiplication > routine if possible. MKL will do the parallelization for you. Well, > any good BLAS implementation would do the same, you dont really need > MKL. ATLAS and ACML would work too, just that MKL has been setup for > us and it works well. Never mind ATLAS. Alternatives to MKL are GotoBLAS2, ACML and ACML-GPU. GotoBLAS2 is generally faster than MKL. The relative performance of ACML and MKL depends on the architecture, but both are now fast on either architecture. ACML-GPU will move matrix multiplication (*GEMM subroutines) to the (AMD/ATI) GPU if it can (and the problem is large enough). MKL used to run in tortoise mode on AMD chips, but not any longer due to intervention by the Federal Trade Commission. IMHO, trying to beat Intel or AMD performance library developers with Python, NumPy and multiprocessing is just silly. Nothing we do with array operator * and np.sum is ever going to compare with BLAS functions from these libraries. Sometimes we need a little bit more course-grained parallelism. Then it's time to think about Python threads and releasing the GIL or use OpenMP with C or Fortran. multiprocessing is the last tool to think about. It is mostly approproate for 'embarassingly parallel' paradigms, and certainly not the tool for parallel matrix multiplication. Sturla From sturla at molden.no Wed Jun 15 13:38:45 2011 From: sturla at molden.no (Sturla Molden) Date: Wed, 15 Jun 2011 19:38:45 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DF8E89D.7080904@molden.no> References: <4DF8E89D.7080904@molden.no> Message-ID: <4DF8EE25.6060600@molden.no> Perhaps it is time to write somthing in the SciPy cookbook about parallel computing with NumPy? It seems to be certain problems that are discussed again and again. These are some issues that come to mind (I'm sure there is more): - The difference between I/O bound, memory bound, and CPU bound work. - Why NumPy code is usually memory bound, and what that means. - The problem with false-sharing in cache lines (including Python refcounts) - What the GIL is and what it's not (real information instead of FUD) - Linear algebra with optimized BLAS and LAPACK libraries. - Parallel FFTs (FFTW, MKL, ACML) - Parallel PRNGs (and algorithmic pitfalls) - Autovectorizing Fortran compilers - OpenMP with C, C++ or Fortran (and using it from Python) - Python threads and releasing the GIL - Python threads in Cython - native threads in Cython - multiprocessing with ordinary NumPy arrays - multiprocessing with shared memory - MPI with Python (mpi4py) - os.fork and copy-on-write memory (including the problem with Python refcounts) - Using GPUs with Python, including ACML-GPU, PyOpenCL and PyCUDA. Sturla From gokhansever at gmail.com Wed Jun 15 15:30:56 2011 From: gokhansever at gmail.com (=?UTF-8?Q?G=C3=B6khan_Sever?=) Date: Wed, 15 Jun 2011 13:30:56 -0600 Subject: [Numpy-discussion] Unicode characters in a numpy array Message-ID: Hello, The following snippet works fine for a regular string and prints out the string without a problem. python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) [GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> mystr = u"??????" >>> mystr u'\xf6\xf6\xf6\u011f\u011f\u011f' >>> type(mystr) >>> print mystr ?????? What is the correct way to print out the following array? >>> import numpy as np >>> arr = np.array(u"??????") >>> arr array(u'\xf6\xf6\xf6\u011f\u011f\u011f', dtype='>> print arr Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/site-packages/numpy/core/numeric.py", line 1379, in array_str return array2string(a, max_line_width, precision, suppress_small, ' ', "", str) File "/usr/lib64/python2.7/site-packages/numpy/core/arrayprint.py", line 426, in array2string lst = style(x) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) Thanks. -- G?khan From mwwiebe at gmail.com Wed Jun 15 16:40:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 15 Jun 2011 15:40:51 -0500 Subject: [Numpy-discussion] datetimes with date vs time units, local time, and time zones Message-ID: Towards a reasonable behavior with regard to local times, I've made the default repr for datetimes use the C standard library to print them in a local ISO format. Combined with the ISO8601-prescribed behavior of interpreting datetime strings with no timezone specifier to be in local times, this allows the following cases to behave reasonably: >>> np.datetime64('now') numpy.datetime64('2011-06-15T15:16:51-0500','s') >>> np.datetime64('2011-06-15T18:00') numpy.datetime64('2011-06-15T18:00-0500','m') As noted in another thread, there can be some extremely surprising behavior as a consequence: >>> np.array(['now', '2011-06-15'], dtype='M') array(['2011-06-15T15:18:26-0500', '2011-06-14T19:00:00-0500'], dtype='datetime64[s]') Having the 15th of June print out as 7pm on the 14th of June is probably not what one would generally expect, so I've come up with an approach which hopefully deals with this in a good way. One firm principal of the datetime in NumPy is that it is always stored as a POSIX time (referencing UTC), or a TAI time. There are two categories of units that can be used, which I will call *date units* and *time units*. The date units are 'Y', 'M', 'W', and 'D', while the time units are 'h', 'm', 's', ..., 'as'. Time zones are only applied to datetimes stored in time units, so there's a qualitative difference between date and time units with respect to string conversions and calendar operations. I would like to place an 'unsafe' casting barrier between the date units and the time units, so that the above conversion from a date into a datetime will raise an error instead of producing a confusing result. This only applies to datetimes and not timedeltas, because for timedeltas the day <-> hour case is fine, it is just the year/month <-> other units which has issues, and that is already treated with an 'unsafe' casting barrier. Two new functions will facilitate the conversions between datetimes with date units and time units: date_as_datetime(datearray, hour, minute, second, microsecond, timezone='local', unit=None, out=None), which converts the provided dates into datetimes at the specified time, according to the specified timezone. If 'unit' is specified, it controls the output unit, otherwise it is the units in 'out' or the amount of precision specified in the function. datetime_as_date(datetimearray, timezone='local', out=None), which converts the provided datetimes into dates according to the specified timezone. In both functions, timezone can be any of 'UTC', 'TAI', 'local', '+/-####', or a datetime.tzinfo object. The latter will allow NumPy datetimes to work with the pytz library for flexible time zone support. I would also like to extend the 'today' input string parsing to accept strings like 'today 12:30' to allow a convenient way to express different local times occurring today, mostly useful for interactive usage. I welcome any comments on this design, particularly if you can find a case where this doesn't produce a reasonable behavior. Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Wed Jun 15 17:22:37 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Wed, 15 Jun 2011 14:22:37 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DF8E89D.7080904@molden.no> References: <4DF8E89D.7080904@molden.no> Message-ID: <4DF9229D.7080002@noaa.gov> Sturla Molden wrote: > IMHO, trying to beat Intel or AMD performance library developers with > Python, NumPy and multiprocessing is just silly. Nothing we do with > array operator * and np.sum is ever going to compare with BLAS functions > from these libraries. I think the issue got confused -- the OP was not looking to speed up a matrix multiply, but rather to speed up a whole bunch of independent matrix multiplies. > Sometimes we need a little bit more course-grained parallelism. Then > it's time to think about Python threads and releasing the GIL or use > OpenMP with C or Fortran. > > multiprocessing is the last tool to think about. It is mostly > approproate for 'embarassingly parallel' paradigms, and certainly not > the tool for parallel matrix multiplication. I think this was, in fact, an embarrassingly parallel example. But when the OP put it on two processors, it was slower than one -- hence his question. I got the same result on my machine as well. I'm not sure he tried python threads, that may be worth a shot. It would also would be great if someone that actually understands this stuff could look at his code and explain why the slowdown occurs (hint, hint!) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Wed Jun 15 18:02:18 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 15 Jun 2011 17:02:18 -0500 Subject: [Numpy-discussion] tighten up ufunc casting rule In-Reply-To: References: <20110602200945.GG2554@phare.normalesup.org> Message-ID: On Tue, Jun 7, 2011 at 2:08 PM, Ralf Gommers wrote: > On Tue, Jun 7, 2011 at 8:59 PM, Mark Wiebe wrote: > >> On Tue, Jun 7, 2011 at 1:41 PM, Ralf Gommers > > wrote: >> >>> On Mon, Jun 6, 2011 at 6:56 PM, Mark Wiebe wrote: >>> >>>> On Mon, Jun 6, 2011 at 10:30 AM, Mark Wiebe wrote: >>>> >>>>> On Sun, Jun 5, 2011 at 3:43 PM, Ralf Gommers < >>>>> ralf.gommers at googlemail.com> wrote: >>>>> >>>>>> On Thu, Jun 2, 2011 at 10:12 PM, Mark Wiebe wrote: >>>>>> >>>>>>> On Thu, Jun 2, 2011 at 3:09 PM, Gael Varoquaux < >>>>>>> gael.varoquaux at normalesup.org> wrote: >>>>>>> >>>>>>>> On Thu, Jun 02, 2011 at 03:06:58PM -0500, Mark Wiebe wrote: >>>>>>>> > Would anyone object to, at least temporarily, tightening up the >>>>>>>> default >>>>>>>> > ufunc casting rule to 'same_kind' in NumPy master? It's a one >>>>>>>> line change, >>>>>>>> > so would be easy to undo, but such a change is very desirable >>>>>>>> in my >>>>>>>> > opinion. >>>>>>>> > This would raise an exception, since it's np.add(a, 1.9, >>>>>>>> out=a), >>>>>>>> > converting a float to an int: >>>>>>>> >>>>>>>> > >>> a = np.arange(3, dtype=np.int32) >>>>>>>> >>>>>>>> > >>> a += 1.9 >>>>>>>> >>>>>>>> That's probably going to break a huge amount of code which relies on >>>>>>>> the >>>>>>>> current behavior. >>>>>>>> >>>>>>>> Am I right in believing that this should only be considered for a >>>>>>>> major >>>>>>>> release of numpy, say numpy 2.0? >>>>>>> >>>>>>> >>>>>>> Absolutely, and that's why I'm proposing to do it in master now, >>>>>>> fairly early in a development cycle, so we can evaluate its effects. If the >>>>>>> next version is 1.7, we probably would roll it back for release (a 1 line >>>>>>> change), and if the next version is 2.0, we probably would keep it in. >>>>>>> >>>>>>> I suspect at least some of the code relying on the current behavior >>>>>>> may have bugs, and tightening this up is a way to reveal them. >>>>>>> >>>>>>> >>>>>> Here are some results of testing your tighten_casting branch on a few >>>>>> projects - no need to first put it in master first to do that. Four failures >>>>>> in numpy, two in scipy, four in scikit-learn (plus two that don't look >>>>>> related), none in scikits.statsmodels. I didn't check how many of them are >>>>>> actual bugs. >>>>>> >>>>>> I'm not against trying out your change, but it would probably be good >>>>>> to do some more testing first and fix the issues found before putting it in. >>>>>> Then at least if people run into issues with the already tested packages, >>>>>> you can just tell them to update those to latest master. >>>>>> >>>>> >>>>> Cool, thanks for running those. I already took a chunk out of the NumPy >>>>> failures. The ones_like function shouldn't really be a ufunc, but rather be >>>>> like zeros_like and empty_like, but that's probably not something to change >>>>> right now. The datetime-fixes type resolution change provides a mechanism to >>>>> fix that up pretty easily. >>>>> >>>>> For Scipy, what do you think is the best way to resolve it? If NumPy >>>>> 1.6 is the minimum version for the next scipy, I would add casting='unsafe' >>>>> to the failing sqrt call. >>>>> >>>> >>>> >>> There's no reason to set the minimum required numpy to 1.6 AFAIK, and >>> it's definitely not desirable. >>> >> >> Ok, I think there are two ways to resolve this kind of error. One would be >> to add a condition testing for a version >= 1.6, and setting >> casting='unsafe' when that occurs, and the other would be to insert a call >> to .astype() to force the type. The former is probably preferable, to avoid >> the unnecessary copy. >> >> Does numpy provide the version in tuple form, so a version comparison like >> this can be done easily? numpy.version doesn't seem to have one in it. >> >> No, just as a string. I think it's fine to do: > > In [4]: np.version.short_version > Out[4]: '2.0.0' > In [5]: np.version.short_version > '1.5.1' > Out[5]: True > > For a very convoluted version that produces a tuple see > parse_numpy_version() in scipy.pavement.py > I've submitted a pull request to scipy which fixes up the ndimage errors, as well as a netcdf bug. I think it would be a good idea to add something like np.version.short_version_tuple, at least before we get to numpy 10.0.0 or 1.10.0, where this comparison will break down. -Mark > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Wed Jun 15 18:16:25 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Thu, 16 Jun 2011 00:16:25 +0200 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: Message-ID: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> On 15.06.2011, at 1:34AM, Mark Wiebe wrote: > These functions are now fully implemented and documented. As always, code reviews are welcome here: > > https://github.com/numpy/numpy/pull/87 > > and for those that don't want to dig into review C code, the commit for the documentation is here: > > https://github.com/m-paradox/numpy/commit/6b5a42a777b16812e774193b06da1b68b92bc689 > > This is probably also another good place to do a merge to master, so if people could test it on Mac/Windows/other platforms that would be much appreciated. > Probaby not specifically related to the business day, but to this month's general datetime fix-up, under Python3 there are a number of test failures of the kinds quoted below. You can review a fix (plus a number of additional test extensions and proposed fix for other failures on master) at https://github.com/dhomeier/numpy/compare/master...dt_tests_2to3 Plus on Mac OS X 10.5 PPC there is a failure due to the fact that the new datetime64 appears to _always_ return 0 for the year part on this architecture: >>> date='1970-03-23 20:00:00Z' >>> np.datetime64(date) np.datetime64('0000-03-23T21:00:00+0100','s') >>> np.datetime64(date, 'Y') np.datetime64('0000','Y') >>> np.datetime64('2011-06-16 02:03:04Z', 'D') np.datetime64('0000-06-16','D') I've tried to track this down in datetime.c, but unsuccessfully so (i.e. I could not connect it to any of the dts->year assignments therein). For the record, all tests pass with Python2.5-2.7 on OS X 10.5/10.6 i386 and x86_64. Cheers, Derek FAIL: test_datetime_as_string (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 970, in test_datetime_as_string '1959') File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: b'0000' DESIRED: '1959' ====================================================================== FAIL: test_days_creation (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 287, in test_days_creation (1900-1970)*365 - (1970-1900)/4) File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 256, in assert_equal return assert_array_equal(actual, desired, err_msg, verbose) File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 706, in assert_array_equal verbose=verbose, header='Arrays are not equal') File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", line 635, in assert_array_compare raise AssertionError(msg) AssertionError: Arrays are not equal (mismatch 100.0%) x: array(-25567, dtype='int64') y: array(-25567.5) From jonathan.taylor at utoronto.ca Wed Jun 15 19:21:15 2011 From: jonathan.taylor at utoronto.ca (Jonathan Taylor) Date: Wed, 15 Jun 2011 19:21:15 -0400 Subject: [Numpy-discussion] Behaviour of ndarray and other objects with __radd__? In-Reply-To: References: Message-ID: Ok. I suspected it was something of that sort and that certainly makes sense to have this ability. Can we perhaps add this to the documentation/specification so that we can more confident that this behavior remains into the future? Thanks, Jonathan. On Wed, Jun 15, 2011 at 11:46 AM, Charles R Harris wrote: > > > On Wed, Jun 15, 2011 at 9:34 AM, Olivier Delalleau wrote: >> >> I don't really understand this behavior either, but juste note that >> according to >> http://docs.scipy.org/doc/numpy/user/c-info.beyond-basics.html >> "This attribute can also be defined by objects that are not sub-types of >> the ndarray" >> >> -=- Olivier >> >> 2011/6/15 Jonathan Taylor >>> >>> Hi, >>> >>> I would like to have objects that I can mix with ndarrays in >>> arithmetic expressions but I need my object to have control of the >>> operation even when it is on the right hand side of the equation. ?I >>> realize from the documentation that the way to do this is to actually >>> subclass ndarray but this is undesirable because I do not need all the >>> heavy machinery of a ndarray and I do not want users to see all of the >>> ndarray methods. ?Is there a way to somehow achieve these goals? >>> >>> I would also very much appreciate some clarification of what is >>> happening in the following basic example: >>> >>> import numpy as np >>> class Foo(object): >>> ? ?# THE NEXT LINE IS COMMENTED >>> ? ?# __array_priority__ = 0 >>> ? ?def __add__(self, other): >>> ? ? ? ?print 'Foo has control over', other >>> ? ? ? ?return 1 >>> ? ?def __radd__(self, other): >>> ? ? ? ?print 'Foo has control over', other >>> ? ? ? ?return 1 >>> >>> x = np.arange(3) >>> f = Foo() >>> >>> print f + x >>> print x + f >>> >>> yields >>> >>> Foo has control over [0 1 2] >>> 1 >>> Foo has control over 0 >>> Foo has control over 1 >>> Foo has control over 2 >>> [1 1 1] >>> >>> I see that I have control from the left side as expected and I suspect >>> that what is happening in the second case is that numpy is trying to >>> "broadcast" my object onto the left side as if it was an object array? >>> >>> Now if I uncomment the line __array_priority__ = 0 I do seem to >>> accomplish my goals (see below) but I am not sure why. ?I am >>> surprised, given what I have read in the documentation, that >>> __array_priority__ does anything in a non subclass of ndarray. >>> Furthermore, I am even more surprised that it does anything when it is >>> 0, which is the same as ndarray.__array_priority__ from what I >>> understand. ?Any clarification of this would be greatly appreciated. >>> > > There's a bit of code in the ufunc implementation that checks for the > __array_priority__ attribute regardless of if the object subclasses ndarray, > probably because someone once needed to solve the same problem you are > having. The comment that goes with it is > > ??? /* > ???? * FAIL with NotImplemented if the other object has > ???? * the __r__ method and has __array_priority__ as > ???? * an attribute (signalling it can handle ndarray's) > ???? * and is not already an ndarray or a subtype of the same type. > ??? */ > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From sturla at molden.no Wed Jun 15 20:09:46 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 02:09:46 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DF9229D.7080002@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> Message-ID: <4DF949CA.9070901@molden.no> Den 15.06.2011 23:22, skrev Christopher Barker: > > It would also would be great if someone that actually understands this > stuff could look at his code and explain why the slowdown occurs (hint, > hint!) > Not sure I qualify, but I think I notice several potential problems in the OP's multiprocessing/NumPy code: "innerProductList = pool.map(myutil.numpy_inner_product, arrayList)" 1. Here we potentially have a case of false sharing and/or mutex contention, as the work is too fine grained. pool.map does not do any load balancing. If pool.map is to scale nicely, each work item must take a substantial amount of time. I suspect this is the main issue. 2. There is also the question of when the process pool is spawned. Though I haven't checked, I suspect it happens prior to calling pool.map. But if it does not, this is a factor as well, particularly on Windows (less so on Linux and Apple). 3. "arrayList" is serialised by pickling, which has a significan overhead. It's not shared memory either, as the OP's code implies, but the main thing is the slowness of cPickle. "IPs = N.array(innerProductList)" 4. numpy.array is a very slow function. The benchmark should preferably not include this overhead. Sturla From sturla at molden.no Wed Jun 15 20:25:13 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 02:25:13 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DF9229D.7080002@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> Message-ID: <4DF94D69.3080802@molden.no> Den 15.06.2011 23:22, skrev Christopher Barker: > > I think the issue got confused -- the OP was not looking to speed up a > matrix multiply, but rather to speed up a whole bunch of independent > matrix multiplies. I would do it like this: 1. Write a Fortran function that make multiple calls DGEMM in a do loop. (Or Fortran intrinsics dot_product or matmul.) 2. Put an OpenMP pragma around the loop (!$omp parallel do). Invoke the OpenMP compiler on compilation. Use static or guided thread scheduling. 3. Call Fortran from Python using f2py, ctypes or Cython. Build with a thread-safe and single-threaded BLAS library. That should run as fast as it gets. Sturla From ed at pythoncharmers.com Thu Jun 16 02:56:17 2011 From: ed at pythoncharmers.com (Ed Schofield) Date: Thu, 16 Jun 2011 16:56:17 +1000 Subject: [Numpy-discussion] Regression in choose() Message-ID: Hi all, I have been investigation the limitation of the choose() method (and function) to 32 elements. This is a regression in recent versions of NumPy. I have tested choose() in the following NumPy versions: 1.0.4: fine 1.1.1: bug 1.2.1: fine 1.3.0: bug 1.4.x: bug 1.5.x: bug 1.6.x: bug Numeric 24.3: fine (To run the tests on versions of NumPy prior to 1.4.x I used Python 2.4.3. For the other tests I used Python 2.7.) Here 'bug' means the choose() function has the 32-element limitation. I have been helping an organization to port a large old Numeric-using codebase to NumPy, and the choose() limitation in recent NumPy versions is throwing a spanner in the works. The codebase is currently using both NumPy and Numeric side-by-side, with Numeric only being used for its choose() function, with a few dozen lines like this: a = numpy.array(Numeric.choose(b, c)) Here is a simple example that triggers the bug. It is a simple extension of the example from the choose() docstring: ---------------- import numpy as np choices = [[0, 1, 2, 3], [10, 11, 12, 13], [20, 21, 22, 23], [30, 31, 32, 33]] np.choose([2, 3, 1, 0], choices * 8) ---------------- A side note: the exception message (defined in core/src/multiarray/iterators.c) is also slightly inconsistent with the actual behaviour: Traceback (most recent call last): File "chooser.py", line 6, in np.choose([2, 3, 1, 0], choices * 8) File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line 277, in choose return _wrapit(a, 'choose', choices, out=out, mode=mode) File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line 37, in _wrapit result = getattr(asarray(obj),method)(*args, **kwds) ValueError: Need between 2 and (32) array objects (inclusive). The actual behaviour is that choose() passes with 31 objects but fails with 32 objects, so this should read "exclusive" rather than "inclusive". (And why the parentheses around 32?) Does anyone know what changed between 1.2.1 and 1.3.0 that introduced the 32-element limitation to choose(), and whether we might be able to lift this limitation again for future NumPy versions? I have a couple of days to work on a patch ... if someone can advise me how to approach this. Best wishes, Ed -- Dr. Edward Schofield Python Charmers +61 (0)405 676 229 http://pythoncharmers.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Thu Jun 16 06:45:53 2011 From: shish at keba.be (Olivier Delalleau) Date: Thu, 16 Jun 2011 06:45:53 -0400 Subject: [Numpy-discussion] Regression in choose() In-Reply-To: References: Message-ID: If performance is not critical you could just write your own function for a quick fix, doing something like numpy.array([choices[j][i] for i, j in enumerate([2, 3, 1, 0])]) This 32-array limitation definitely looks weird to me though, doesn't seem to make sense. -=- Olivier 2011/6/16 Ed Schofield > Hi all, > > I have been investigation the limitation of the choose() method (and > function) to 32 elements. This is a regression in recent versions of NumPy. > I have tested choose() in the following NumPy versions: > > 1.0.4: fine > 1.1.1: bug > 1.2.1: fine > 1.3.0: bug > 1.4.x: bug > 1.5.x: bug > 1.6.x: bug > Numeric 24.3: fine > > (To run the tests on versions of NumPy prior to 1.4.x I used Python 2.4.3. > For the other tests I used Python 2.7.) > > Here 'bug' means the choose() function has the 32-element limitation. I > have been helping an organization to port a large old Numeric-using codebase > to NumPy, and the choose() limitation in recent NumPy versions is throwing a > spanner in the works. The codebase is currently using both NumPy and Numeric > side-by-side, with Numeric only being used for its choose() function, with a > few dozen lines like this: > > a = numpy.array(Numeric.choose(b, c)) > > Here is a simple example that triggers the bug. It is a simple extension of > the example from the choose() docstring: > > ---------------- > > import numpy as np > > choices = [[0, 1, 2, 3], [10, 11, 12, 13], > [20, 21, 22, 23], [30, 31, 32, 33]] > > np.choose([2, 3, 1, 0], choices * 8) > > ---------------- > > A side note: the exception message (defined in > core/src/multiarray/iterators.c) is also slightly inconsistent with the > actual behaviour: > > Traceback (most recent call last): > File "chooser.py", line 6, in > np.choose([2, 3, 1, 0], choices * 8) > File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line > 277, in choose > return _wrapit(a, 'choose', choices, out=out, mode=mode) > File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line > 37, in _wrapit > result = getattr(asarray(obj),method)(*args, **kwds) > ValueError: Need between 2 and (32) array objects (inclusive). > > The actual behaviour is that choose() passes with 31 objects but fails with > 32 objects, so this should read "exclusive" rather than "inclusive". (And > why the parentheses around 32?) > > Does anyone know what changed between 1.2.1 and 1.3.0 that introduced the > 32-element limitation to choose(), and whether we might be able to lift this > limitation again for future NumPy versions? I have a couple of days to work > on a patch ... if someone can advise me how to approach this. > > Best wishes, > Ed > > > -- > Dr. Edward Schofield > Python Charmers > +61 (0)405 676 229 > http://pythoncharmers.com > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmp50 at ukr.net Thu Jun 16 06:59:28 2011 From: tmp50 at ukr.net (Dmitrey) Date: Thu, 16 Jun 2011 13:59:28 +0300 Subject: [Numpy-discussion] [ANN] OpenOpt suite 0.34 Message-ID: Hi all, I'm glad to inform you about new quarterly release 0.34 of the OOSuite package software (OpenOpt, FuncDesigner, SpaceFuncs, DerApproximator) . Main changes: * Python 3 compatibility * Lots of improvements and speedup for interval calculations * Now interalg can obtain all solutions of nonlinear equation (example) or systems of them (example) in the involved box lb_i <= x_i <= ub_i (bounds can be very large), possibly constrained (e.g. sin(x) + cos(y+x) > 0.5). * Many other improvements and speedup for interalg. See http://forum.openopt.org/viewtopic.php?id=425 for more details. Regards, D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ckkart at hoc.net Thu Jun 16 07:28:44 2011 From: ckkart at hoc.net (Christian K.) Date: Thu, 16 Jun 2011 11:28:44 +0000 (UTC) Subject: [Numpy-discussion] 3d ODR Message-ID: Hi, I need to do fit a 3d surface to a point cloud. This sounds like a job for 3d orthogonal distance regression. Does anybody know of an implementation? Regards, Christian K. From michael at klitgaard.dk Thu Jun 16 09:02:13 2011 From: michael at klitgaard.dk (Michael Klitgaard) Date: Thu, 16 Jun 2011 15:02:13 +0200 Subject: [Numpy-discussion] Using numpy.fromfile with structured array skipping some elements. In-Reply-To: References: Message-ID: Hello, I hope this is the right mailings list for a numpy user questions, if not, I'm sorry. Im reading a binary file with numpy.fromfile() The binary file is constructed with some integers for checking the data for corruption. This is how the binary file is constructed: Timestamp ? ? ? [ 12 bytes] [ 1 int ] ? ? ? check [ 1 single ] ? ?Time stamp (single precision). [ 1 int ] ? ? ? check Data chunk ? ? ?[ 4*(no_sensor+2) bytes ] [ 1 int ] ? ? ? check [ no_sensor single ] ? ?Array of sensor readings (single precision). [ 1 int ] ? ? ? check The file continues this way [ Timestamp ] [ Data chunk ] [ Timestamp ] [ Data chunk ] .. no_sensor is file dependend int = 4 bytes single = 4 bytes This is my current procedure f = open(file,'rb') f.read(size_of_header) ?# The file contains a header, where fx. the no_sensor can be read. dt = np.dtype([('junk0', 'i4'), ? ? ? ? ? ? ? ('timestamp', 'f4'), ? ? ? ? ? ? ? ('junk1', 'i4'), ? ? ? ? ? ? ? ('junk2', 'i4'), ? ? ? ? ? ? ? ('sensors', ('f4',no_sensor)), ? ? ? ? ? ? ? ('junk3', 'i4')]) data = np.fromfile(f, dtype=dt) Now the data is read in and I can access it, but I have the 'junk' in the array, which annoys me. Is there a way to remove the junk data, or skip it with fromfile ? Another issue is that when accessing one sensor, I do it this way: data['sensors'][:,0] for the first sensor, would it be possible to just do: data['sensors'][0] ? Thank you! Sincerely Michael Klitgaard From e.antero.tammi at gmail.com Thu Jun 16 09:56:25 2011 From: e.antero.tammi at gmail.com (eat) Date: Thu, 16 Jun 2011 16:56:25 +0300 Subject: [Numpy-discussion] 3d ODR In-Reply-To: References: Message-ID: Hi, On Thu, Jun 16, 2011 at 2:28 PM, Christian K. wrote: > Hi, > > I need to do fit a 3d surface to a point cloud. This sounds like a job for > 3d > orthogonal distance regression. Does anybody know of an implementation? > How bout http://docs.scipy.org/doc/scipy/reference/odr.htm My 2 cents, eat > > Regards, Christian K. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Thu Jun 16 10:16:16 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 16 Jun 2011 09:16:16 -0500 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DF94D69.3080802@molden.no> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> Message-ID: <4DFA1030.7070403@gmail.com> On 06/15/2011 07:25 PM, Sturla Molden wrote: > Den 15.06.2011 23:22, skrev Christopher Barker: >> I think the issue got confused -- the OP was not looking to speed up a >> matrix multiply, but rather to speed up a whole bunch of independent >> matrix multiplies. > I would do it like this: > > 1. Write a Fortran function that make multiple calls DGEMM in a do loop. > (Or Fortran intrinsics dot_product or matmul.) > > 2. Put an OpenMP pragma around the loop (!$omp parallel do). Invoke the > OpenMP compiler on compilation. Use static or guided thread scheduling. > > 3. Call Fortran from Python using f2py, ctypes or Cython. > > Build with a thread-safe and single-threaded BLAS library. > > That should run as fast as it gets. > > Sturla > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion The idea was to calculate: innerProduct =numpy.sum(array1*array2) where array1 and array2 are, in general, multidimensional. Numpy has an inner product function called np.inner (http://www.scipy.org/Numpy_Example_List_With_Doc#inner) which is a special case of tensordot. See the documentation for what is does and other examples. Also note that np.inner provides of the cross-products as well. For example, you can just do: >>> import numpy as np >>> a=np.arange(10).reshape(2,5) >>> a array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]) >>> b=a*10 >>> b array([[ 0, 10, 20, 30, 40], [50, 60, 70, 80, 90]]) >>> c=a*b >>> c array([[ 0, 10, 40, 90, 160], [250, 360, 490, 640, 810]]) >>> c.sum() 2850 >>>d=np.inner(a,b) >>>d array([[ 300, 800], [ 800, 2550]]) >>>np.diag(d).sum() 2850 I do not know if numpy's multiplication, np.inner() and np.sum() are single-threaded and thread-safe when you link to multi-threaded blas, lapack or altas libraries. But if everything is *single-threaded* and thread-safe, then you just create a function and use Anne's very useful handythread.py (http://www.scipy.org/Cookbook/Multithreading). Otherwise you would have to determine the best approach such doing nothing (especially if the arrays are huge), or making the functions single-thread. By the way, if the arrays are sufficiently small, there is a lot of overhead involved such that there is more time in communication than computation. So you have to determine the best algorithm for you case. Bruce From ben.root at ou.edu Thu Jun 16 10:18:46 2011 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 16 Jun 2011 09:18:46 -0500 Subject: [Numpy-discussion] datetimes with date vs time units, local time, and time zones In-Reply-To: References: Message-ID: On Wednesday, June 15, 2011, Mark Wiebe wrote: > Towards a reasonable behavior with regard to local times, I've made the default repr for datetimes use the C standard library to print them in a local ISO format. Combined with the ISO8601-prescribed behavior of interpreting datetime strings with no timezone specifier to be in local times, this allows the following cases to behave reasonably: > >>>> np.datetime64('now')numpy.datetime64('2011-06-15T15:16:51-0500','s') >>>> np.datetime64('2011-06-15T18:00') > numpy.datetime64('2011-06-15T18:00-0500','m') > As noted in another thread, there can be some extremely surprising behavior as a consequence: > >>>> np.array(['now', '2011-06-15'], dtype='M')array(['2011-06-15T15:18:26-0500', '2011-06-14T19:00:00-0500'], dtype='datetime64[s]') > Having the 15th of June print out as 7pm on the 14th of June is probably not what one would generally expect, so I've come up with an approach which hopefully deals with this in a good way. > > One firm principal of the datetime in NumPy is that it is always stored as a POSIX time (referencing UTC), or a TAI time. There are two categories of units that can be used, which I will call date units?and time units. The date units are 'Y', 'M', 'W', and 'D', while the time units are 'h', 'm', 's', ..., 'as'. Time zones are only applied to datetimes stored in time units, so there's a qualitative difference between date and time units with respect to string conversions and calendar operations. > > I would like to place an 'unsafe' casting barrier between the date units and the time units, so that the above conversion from a date into a datetime will raise an error instead of producing a confusing result. This only applies to datetimes and not timedeltas, because for timedeltas the day <-> hour case is fine, it is just the year/month <-> other units which has issues, and that is already treated with an 'unsafe' casting barrier. > > Two new functions will facilitate the conversions between datetimes with date units and time units: > date_as_datetime(datearray, hour, minute, second, microsecond, timezone='local', unit=None, out=None), which converts the provided dates into datetimes at the specified time, according to the specified timezone. If 'unit' is specified, it controls the output unit, otherwise it is the units in 'out' or the amount of precision specified in the function. > > datetime_as_date(datetimearray, timezone='local', out=None), which converts the provided datetimes into dates according to the specified timezone. > In both functions, timezone can be any of 'UTC', 'TAI', 'local', '+/-####', or a datetime.tzinfo object. The latter will allow NumPy datetimes to work with the pytz library for flexible time zone support. > > I would also like to extend the 'today' input string parsing to accept strings like 'today 12:30' to allow a convenient way to express different local times occurring today, mostly useful for interactive usage. > > I welcome any comments on this design, particularly if you can find a case where this doesn't produce a reasonable behavior. > Cheers,Mark > Is the output for the given usecase above with the mix of 'now' and a datetime string without tz info intended to still be correct? I personally have misgivings about interpreating phrases like "now" and "today" at this level. I think it introduces a can of worms that would be difficult to handle. Consider some arbitrary set of inputs to the array function for datetime objects. If they all contain no tz info, then they are all interpreated the same as-is. However, if even one element has 'now', then the inputs are interpreated entirely differently. This will confuse people. Just thinking out loud here, What about a case where the inputs are such that some do not specify tz and some others specify a mix of timezones? Should that be any different from the case given above? It has been awhile for me, but how different is this from Perl's floating tz for its datetime module? Maybe we could combine its approach with your "unsafe" barrier for the ambiguous situations that perl's datetime module mentions? Ben Root From sturla at molden.no Thu Jun 16 10:23:40 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 16:23:40 +0200 Subject: [Numpy-discussion] Using numpy.fromfile with structured array skipping some elements. In-Reply-To: References: Message-ID: <4DFA11EC.8010101@molden.no> Den 16.06.2011 15:02, skrev Michael Klitgaard: > Now the data is read in and I can access it, but I have the 'junk' in > the array, which annoys me. > Is there a way to remove the junk data, or skip it with fromfile ? Not that I know of, unless you are ok with a loop (either in Python or C). If it's that important (e.g. to save physical memory), you could also memory map the file and never reference the junk data. Sturla From sturla at molden.no Thu Jun 16 10:30:16 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 16:30:16 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA1030.7070403@gmail.com> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> Message-ID: <4DFA1378.4050804@molden.no> Den 16.06.2011 16:16, skrev Bruce Southey: > The idea was to calculate: > innerProduct =numpy.sum(array1*array2) where array1 and array2 are, in > general, multidimensional. If you want to do that in parallel, fast an with minimal overhead, it's a typical OpenMP problem. If the dimensions are unknown in advance, I'd probably prefer a loop in C instead of Fortran and BLAS. Sturla From bsouthey at gmail.com Thu Jun 16 10:51:39 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 16 Jun 2011 09:51:39 -0500 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA1378.4050804@molden.no> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> Message-ID: <4DFA187B.7010608@gmail.com> On 06/16/2011 09:30 AM, Sturla Molden wrote: > Den 16.06.2011 16:16, skrev Bruce Southey: >> The idea was to calculate: >> innerProduct =numpy.sum(array1*array2) where array1 and array2 are, in >> general, multidimensional. > If you want to do that in parallel, fast an with minimal overhead, it's > a typical OpenMP problem. If the dimensions are unknown in advance, I'd > probably prefer a loop in C instead of Fortran and BLAS. > > Sturla > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Remember that is what the OP wanted to do, not me. My issue with this is that this suggestion may be overkill and too complex for the OP without knowing if there is 'sufficient' performance gain (let alone gaining back the time spent ensuring that code is correct and thread-safe as possible). Bruce From mwwiebe at gmail.com Thu Jun 16 11:27:07 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 16 Jun 2011 10:27:07 -0500 Subject: [Numpy-discussion] datetimes with date vs time units, local time, and time zones In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 9:18 AM, Benjamin Root wrote: > On Wednesday, June 15, 2011, Mark Wiebe wrote: > > Towards a reasonable behavior with regard to local times, I've made the > default repr for datetimes use the C standard library to print them in a > local ISO format. Combined with the ISO8601-prescribed behavior of > interpreting datetime strings with no timezone specifier to be in local > times, this allows the following cases to behave reasonably: > > > >>>> np.datetime64('now')numpy.datetime64('2011-06-15T15:16:51-0500','s') > >>>> np.datetime64('2011-06-15T18:00') > > numpy.datetime64('2011-06-15T18:00-0500','m') > > As noted in another thread, there can be some extremely surprising > behavior as a consequence: > > > >>>> np.array(['now', '2011-06-15'], > dtype='M')array(['2011-06-15T15:18:26-0500', '2011-06-14T19:00:00-0500'], > dtype='datetime64[s]') > > Having the 15th of June print out as 7pm on the 14th of June is probably > not what one would generally expect, so I've come up with an approach which > hopefully deals with this in a good way. > > > > One firm principal of the datetime in NumPy is that it is always stored > as a POSIX time (referencing UTC), or a TAI time. There are two categories > of units that can be used, which I will call date units and time units. The > date units are 'Y', 'M', 'W', and 'D', while the time units are 'h', 'm', > 's', ..., 'as'. Time zones are only applied to datetimes stored in time > units, so there's a qualitative difference between date and time units with > respect to string conversions and calendar operations. > > > > I would like to place an 'unsafe' casting barrier between the date units > and the time units, so that the above conversion from a date into a datetime > will raise an error instead of producing a confusing result. This only > applies to datetimes and not timedeltas, because for timedeltas the day <-> > hour case is fine, it is just the year/month <-> other units which has > issues, and that is already treated with an 'unsafe' casting barrier. > > > > Two new functions will facilitate the conversions between datetimes with > date units and time units: > > date_as_datetime(datearray, hour, minute, second, microsecond, > timezone='local', unit=None, out=None), which converts the provided dates > into datetimes at the specified time, according to the specified timezone. > If 'unit' is specified, it controls the output unit, otherwise it is the > units in 'out' or the amount of precision specified in the function. > > > > datetime_as_date(datetimearray, timezone='local', out=None), which > converts the provided datetimes into dates according to the specified > timezone. > > In both functions, timezone can be any of 'UTC', 'TAI', 'local', > '+/-####', or a datetime.tzinfo object. The latter will allow NumPy > datetimes to work with the pytz library for flexible time zone support. > > > > I would also like to extend the 'today' input string parsing to accept > strings like 'today 12:30' to allow a convenient way to express different > local times occurring today, mostly useful for interactive usage. > > > > I welcome any comments on this design, particularly if you can find a > case where this doesn't produce a reasonable behavior. > > Cheers,Mark > > > > Is the output for the given usecase above with the mix of 'now' and a > datetime string without tz info intended to still be correct? No, that case would fail. The resolution of 'now' is seconds, and the resolution of a date string is days, so the case would require a conversion across the date unit/time unit boundary. > I personally have misgivings about interpreating phrases like "now" and > "today" at this level. I think it introduces a can of worms that > would be difficult to handle. > I like the convenience it gives at the interactive prompt, but maybe a datetime_from_string function where you can selectively enable/disable allowing of these special values and local times can provide control over this. This is similar to the datetime_as_string function which gives more flexibility than simple conversion to a string. Consider some arbitrary set of inputs to the array function for > datetime objects. If they all contain no tz info, then they are all > interpreated the same as-is. However, if even one element has 'now', > then the inputs are interpreated entirely differently. This will > confuse people. > The element 'now' has no effect on the other inputs, except to possibly promote the unit to a seconds level of precision. All datetimes are in UTC, and when timezone information is given, that is only used for parsing the input, it is not preserved. Just thinking out loud here, What about a case where the inputs are > such that some do not specify tz and some others specify a mix of > timezones? Should that be any different from the case given above? > I think this has the same answer, everything gets converted to UTC. It has been awhile for me, but how different is this from Perl's > floating tz for its datetime module? Maybe we could combine its > approach with your "unsafe" barrier for the ambiguous situations that > perl's datetime module mentions? > I'd rather not attach timezone information to the numpy datetime, the pytz library appears to already support this kind of thing, and I see no reason to duplicate that effort, but rather support the pytz timezone objects in certain datetime manipulation routines. Cheers, Mark > > Ben Root > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 16 11:40:59 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 16 Jun 2011 10:40:59 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> Message-ID: On Wed, Jun 15, 2011 at 5:16 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 15.06.2011, at 1:34AM, Mark Wiebe wrote: > > > These functions are now fully implemented and documented. As always, code > reviews are welcome here: > > > > https://github.com/numpy/numpy/pull/87 > > > > and for those that don't want to dig into review C code, the commit for > the documentation is here: > > > > > https://github.com/m-paradox/numpy/commit/6b5a42a777b16812e774193b06da1b68b92bc689 > > > > This is probably also another good place to do a merge to master, so if > people could test it on Mac/Windows/other platforms that would be much > appreciated. > > > Probaby not specifically related to the business day, but to this month's > general datetime fix-up, > under Python3 there are a number of test failures of the kinds quoted > below. > You can review a fix (plus a number of additional test extensions and > proposed fix for other > failures on master) at > https://github.com/dhomeier/numpy/compare/master...dt_tests_2to3 Thanks for the testing and proposed fixes, I've added some comments to the commits. > Plus on Mac OS X 10.5 PPC there is a failure due to the fact that the new > datetime64 appears > to _always_ return 0 for the year part on this architecture: > > >>> date='1970-03-23 20:00:00Z' > >>> np.datetime64(date) > np.datetime64('0000-03-23T21:00:00+0100','s') > >>> np.datetime64(date, 'Y') > np.datetime64('0000','Y') > >>> np.datetime64('2011-06-16 02:03:04Z', 'D') > np.datetime64('0000-06-16','D') > > I've tried to track this down in datetime.c, but unsuccessfully so (i.e. I > could not connect it > to any of the dts->year assignments therein). > This is definitely perplexing. Probably the first thing to check is whether it's breaking during parsing or printing. This should always produce the same result: >>> np.datetime64('1970-03-23 20:00:00Z').astype('i8') 7070400 But maybe the test_days_creation is already checking that thoroughly enough. Then, maybe printf-ing the year value at various stages of the printing, like in set_datetimestruct_days, after convert_datetime_to_datetimestruct, and in make_iso_8601_date. This would at least isolate where the year is getting lost. For the record, all tests pass with Python2.5-2.7 on OS X 10.5/10.6 i386 and > x86_64. > Great! Cheers, Mark > Cheers, > Derek > > > FAIL: test_datetime_as_string (test_datetime.TestDateTime) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_datetime.py", > line 970, in test_datetime_as_string > '1959') > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 313, in assert_equal > raise AssertionError(msg) > AssertionError: > Items are not equal: > ACTUAL: b'0000' > DESIRED: '1959' > > ====================================================================== > FAIL: test_days_creation (test_datetime.TestDateTime) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_datetime.py", > line 287, in test_days_creation > (1900-1970)*365 - (1970-1900)/4) > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 256, in assert_equal > return assert_array_equal(actual, desired, err_msg, verbose) > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 706, in assert_array_equal > verbose=verbose, header='Arrays are not equal') > File "/Users/derek/lib/python3.2/site-packages/numpy/testing/utils.py", > line 635, in assert_array_compare > raise AssertionError(msg) > AssertionError: > Arrays are not equal > > (mismatch 100.0%) > x: array(-25567, dtype='int64') > y: array(-25567.5) > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 16 12:39:16 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 16 Jun 2011 11:39:16 -0500 Subject: [Numpy-discussion] 3d ODR In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 06:28, Christian K. wrote: > Hi, > > I need to do fit a 3d surface to a point cloud. This sounds like a job for 3d > orthogonal distance regression. Does anybody know of an implementation? As eat points out, scipy.odr should do the job nicely. You should read at least part of the ODRPACK User's Guide in addition to the scipy docs for it: http://docs.scipy.org/doc/external/odrpack_guide.pdf The FORTRAN examples there have scipy.odr translations in the unit tests: https://github.com/scipy/scipy/blob/master/scipy/odr/tests/test_odr.py Exactly what you need to do depends on the details of your problem. If your surface is of the form z=f(x,y), that's a pretty straightforward multidimensional model. If you have a closed surface, however, you will need to use an implicit model of the form f(x,y,z)=0, which may or may not be easy to formulate. The unit test for the implicit case fits a general 2D ellipse to a 2D cloud of points, as described in the User's Guide. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From Chris.Barker at noaa.gov Thu Jun 16 12:44:05 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 16 Jun 2011 09:44:05 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA187B.7010608@gmail.com> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> Message-ID: <4DFA32D5.4050209@noaa.gov> NOTE: I'm only taking part in this discussion because it's interesting and I hope to learn something. I do hope the OP chimes back in to clarify his needs, but in the meantime... Bruce Southey wrote: > Remember that is what the OP wanted to do, not me. Actually, I don't think that's what the OP wanted -- I think we have a conflict between the need for concrete examples, and the desire to find a generic solution, so I think this is what the OP wants: How to best multiprocess a _generic_ operation that needs to be performed on a lot of arrays. Something like: output = [] for a in a_bunch_of_arrays: output.append( a_function(a) ) More specifically, a_function() is an inner product, *defined by the user*. So there is no way to optimize the inner product itself (that will be up to the user), nor any way to generally convert the bunch_of_arrays to a single array with a single higher-dimensional operation. In testing his approach, the OP used a numpy multiply, and a simple, loop-through-the elements multiply, and found that with his multiprocessing calls, the simple loop was a fair bit faster with two processors, but that the numpy one was slower with two processors. Of course, the looping method was much, much, slower than the numpy one in any case. So Sturla's comments are probably right on: Sturla Molden wrote: > "innerProductList = pool.map(myutil.numpy_inner_product, arrayList)" > > 1. Here we potentially have a case of false sharing and/or mutex > contention, as the work is too fine grained. pool.map does not do any > load balancing. If pool.map is to scale nicely, each work item must take > a substantial amount of time. I suspect this is the main issue. > 2. There is also the question of when the process pool is spawned. > Though I haven't checked, I suspect it happens prior to calling > pool.map. But if it does not, this is a factor as well, particularly on > Windows (less so on Linux and Apple). It didn't work well on my Mac, so ti's either not an issue, or not Windows-specific, anyway. > 3. "arrayList" is serialised by pickling, which has a significan > overhead. It's not shared memory either, as the OP's code implies, but > the main thing is the slowness of cPickle. I'll bet this is a big issue, and one I'm curious about how to address, I have another problem where I need to multi-process, and I'd love to know a way to pass data to the other process and back *without* going through pickle. maybe memmapped files? > "IPs = N.array(innerProductList)" > > 4. numpy.array is a very slow function. The benchmark should preferably > not include this overhead. I re-ran, moving that out of the timing loop, and, indeed, it helped a lot, but it still takes longer with the multi-processing. I suspect that the overhead of pickling, etc. is overwhelming the operation itself. That and the load balancing issue that I don't understand! To test this, I did a little experiment -- creating a "fake" operation, one that simply returns an element from the input array -- so it should take next to no time, and we can time the overhead of the pickling, etc: $ python shared_mem.py Using 2 processes No shared memory, numpy array multiplication took 0.124427080154 seconds Shared memory, numpy array multiplication took 0.586215019226 seconds No shared memory, fake array multiplication took 0.000391006469727 seconds Shared memory, fake array multiplication took 0.54935503006 seconds No shared memory, my array multiplication took 23.5055780411 seconds Shared memory, my array multiplication took 13.0932741165 seconds Bingo! The overhead of the multi-processing takes about .54 seconds, which explains the slowdown for the numpy method not so mysterious after all. Bruce Southey wrote: > But if everything is *single-threaded* and > thread-safe, then you just create a function and use Anne's very useful > handythread.py (http://www.scipy.org/Cookbook/Multithreading). This may be worth a try -- though the GIL could well get in the way. > By the way, if the arrays are sufficiently small, there is a lot of > overhead involved such that there is more time in communication than > computation. yup -- clearly the case here. I wonder if it's just array size though -- won't cPickle time scale with array size? So it may not be size pe-se, but rather how much computation you need for a given size array. -Chris [I've enclosed the OP's slightly altered code] -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- A non-text attachment was scrubbed... Name: myutil.py Type: application/x-python Size: 592 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: shared_mem.py Type: application/x-python Size: 2030 bytes Desc: not available URL: From oliphant at enthought.com Thu Jun 16 13:05:52 2011 From: oliphant at enthought.com (Travis Oliphant) Date: Thu, 16 Jun 2011 12:05:52 -0500 Subject: [Numpy-discussion] Regression in choose() In-Reply-To: References: Message-ID: <3C246651-E232-4AF4-9838-1F47D9D86A21@enthought.com> Hi Ed, I'm pretty sure that this is "bug" is due to the compile-time constant NPY_MAXARGS defined in include/numpy/ndarraytypes.h I suspect that the versions you are trying it on where it succeeds as a different compile-time constant of that value. NumPy uses a multi-iterator object (PyArrayMultiIterObject defined in ndarraytypes.h as well) to broadcast arguments together for ufuncs and for functions like choose. The data-structure that it uses to do this has a static array of Iterator objects with room for NPY_MAXARGS iterators. I think in some versions this compile time constant has been 40 or higher. Re-compiling NumPy by bumping up that constant will of course require re-compilation of most extensions modules that use the NumPy API. Numeric did not use this approach to broadcast the arguments to choose together and so likely does not have the same limitation. It would also not be that difficult to modify the NumPy code to dynamically allocate the iters array when needed to remove the NPY_MAXARGS limitation. In fact, I would not mind seeing all the NPY_MAXDIMS and NPY_MAXARGS limitations removed. To do it well you would probably want to have some minimum storage-space pre-allocated (say NPY_MAXDIMS as 7 and NPY_MAXARGS as 10 to avoid the memory allocation in common cases) and just increase that space as needed dynamically. This would be a nice project for someone wanting to learn the NumPy code base. -Travis On Jun 16, 2011, at 1:56 AM, Ed Schofield wrote: > Hi all, > > I have been investigation the limitation of the choose() method (and function) to 32 elements. This is a regression in recent versions of NumPy. I have tested choose() in the following NumPy versions: > > 1.0.4: fine > 1.1.1: bug > 1.2.1: fine > 1.3.0: bug > 1.4.x: bug > 1.5.x: bug > 1.6.x: bug > Numeric 24.3: fine > > (To run the tests on versions of NumPy prior to 1.4.x I used Python 2.4.3. For the other tests I used Python 2.7.) > > Here 'bug' means the choose() function has the 32-element limitation. I have been helping an organization to port a large old Numeric-using codebase to NumPy, and the choose() limitation in recent NumPy versions is throwing a spanner in the works. The codebase is currently using both NumPy and Numeric side-by-side, with Numeric only being used for its choose() function, with a few dozen lines like this: > > a = numpy.array(Numeric.choose(b, c)) > > Here is a simple example that triggers the bug. It is a simple extension of the example from the choose() docstring: > > ---------------- > > import numpy as np > > choices = [[0, 1, 2, 3], [10, 11, 12, 13], > [20, 21, 22, 23], [30, 31, 32, 33]] > > np.choose([2, 3, 1, 0], choices * 8) > > ---------------- > > A side note: the exception message (defined in core/src/multiarray/iterators.c) is also slightly inconsistent with the actual behaviour: > > Traceback (most recent call last): > File "chooser.py", line 6, in > np.choose([2, 3, 1, 0], choices * 8) > File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line 277, in choose > return _wrapit(a, 'choose', choices, out=out, mode=mode) > File "/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py", line 37, in _wrapit > result = getattr(asarray(obj),method)(*args, **kwds) > ValueError: Need between 2 and (32) array objects (inclusive). > > The actual behaviour is that choose() passes with 31 objects but fails with 32 objects, so this should read "exclusive" rather than "inclusive". (And why the parentheses around 32?) > > Does anyone know what changed between 1.2.1 and 1.3.0 that introduced the 32-element limitation to choose(), and whether we might be able to lift this limitation again for future NumPy versions? I have a couple of days to work on a patch ... if someone can advise me how to approach this. > > Best wishes, > Ed > > > -- > Dr. Edward Schofield > Python Charmers > +61 (0)405 676 229 > http://pythoncharmers.com > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion --- Travis Oliphant Enthought, Inc. oliphant at enthought.com 1-512-536-1057 http://www.enthought.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From robince at gmail.com Thu Jun 16 13:23:06 2011 From: robince at gmail.com (Robin) Date: Thu, 16 Jun 2011 19:23:06 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA32D5.4050209@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> Message-ID: On Thu, Jun 16, 2011 at 6:44 PM, Christopher Barker wrote: >> 2. There is also the question of when the process pool is spawned. Though >> I haven't checked, I suspect it happens prior to calling pool.map. But if it >> does not, this is a factor as well, particularly on Windows (less so on >> Linux and Apple). > > It didn't work well on my Mac, so ti's either not an issue, or not > Windows-specific, anyway. I am pretty sure that the process pool is spawned when you create the pool object instance. > >> 3. ?"arrayList" is serialised by pickling, which has a significan >> overhead. ?It's not shared memory either, as the OP's code implies, but the >> main thing is the slowness of cPickle. > > I'll bet this is a big issue, and one I'm curious about how to address, I > have another problem where I need to multi-process, and I'd love to know a > way to pass data to the other process and back *without* going through > pickle. maybe memmapped files? If you are on Linux or Mac then fork works nicely so you have read only shared memory you just have to put it in a module before the fork (so before pool = Pool() ) and then all the subprocesses can access it without any pickling required. ie myutil.data = listofdata p = multiprocessing.Pool(8) def mymapfunc(i): return mydatafunc(myutil.data[i]) p.map(mymapfunc, range(len(myutil.data))) Actually that won't work because mymapfunc needs to be in a module so it can be pickled but hopefully you get the idea. Cheers Robin > >> "IPs = N.array(innerProductList)" >> >> 4. ?numpy.array is a very slow function. The benchmark should preferably >> not include this overhead. > > I re-ran, moving that out of the timing loop, and, indeed, it helped a lot, > but it still takes longer with the multi-processing. > > I suspect that the overhead of pickling, etc. is overwhelming the operation > itself. That and the load balancing issue that I don't understand! > > To test this, I did a little experiment -- creating a "fake" operation, one > that simply returns an element from the input array -- so it should take > next to no time, and we can time the overhead of the pickling, etc: > > $ python shared_mem.py > > Using 2 processes > No shared memory, numpy array multiplication took 0.124427080154 seconds > Shared memory, numpy array multiplication took 0.586215019226 seconds > > No shared memory, fake array multiplication took 0.000391006469727 seconds > Shared memory, fake array multiplication took 0.54935503006 seconds > > No shared memory, my array multiplication took 23.5055780411 seconds > Shared memory, my array multiplication took 13.0932741165 seconds > > Bingo! > > The overhead of the multi-processing takes about .54 seconds, which explains > the slowdown for the numpy method > > not so mysterious after all. > > Bruce Southey wrote: > >> But if everything is *single-threaded* and thread-safe, then you just >> create a function and use Anne's very useful handythread.py >> (http://www.scipy.org/Cookbook/Multithreading). > > This may be worth a try -- though the GIL could well get in the way. > >> By the way, if the arrays are sufficiently small, there is a lot of >> overhead involved such that there is more time in communication than >> computation. > > yup -- clearly the case here. I wonder if it's just array size though -- > won't cPickle time scale with array size? So it may not be size pe-se, but > rather how much computation you need for a given size array. > > -Chris > > [I've enclosed the OP's slightly altered code] > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R ? ? ? ? ? ?(206) 526-6959 ? voice > 7600 Sand Point Way NE ? (206) 526-6329 ? fax > Seattle, WA ?98115 ? ? ? (206) 526-6317 ? main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From sturla at molden.no Thu Jun 16 13:30:20 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 19:30:20 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA32D5.4050209@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> Message-ID: <4DFA3DAC.1000704@molden.no> Den 16.06.2011 18:44, skrev Christopher Barker: > > I'll bet this is a big issue, and one I'm curious about how to > address, I have another problem where I need to multi-process, and I'd > love to know a way to pass data to the other process and back > *without* going through pickle. maybe memmapped files? Remember that os.fork is copy-on-write optimized. Often this is enough to share data. To get data back, a possibility is to use shared memory: just memmap from file 0 or -1 (depending on the system) and fork. Note that the use of fork, as opposed to multiprocessing, avoids the pickle overhead for NumPy arrays. On Windows this sucks, because there is no fork system call. Here we are stuck with multiprocessing and pickle, even if we use shared memory. Sturla From Chris.Barker at noaa.gov Thu Jun 16 13:47:02 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 16 Jun 2011 10:47:02 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA3DAC.1000704@molden.no> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> <4DFA3DAC.1000704@molden.no> Message-ID: <4DFA4196.7070208@noaa.gov> Sturla Molden wrote: > On Windows this sucks, because there is no fork system call. Darn -- I do need to support Windows. > Here we are > stuck with multiprocessing and pickle, even if we use shared memory. What do you need to pickle if you're using shared memory? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From sturla at molden.no Thu Jun 16 13:49:27 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 19:49:27 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> Message-ID: <4DFA4227.2080202@molden.no> Den 16.06.2011 19:23, skrev Robin: > > If you are on Linux or Mac then fork works nicely so you have read > only shared memory you just have to put it in a module before the fork > (so before pool = Pool() ) and then all the subprocesses can access it > without any pickling required. ie > myutil.data = listofdata > p = multiprocessing.Pool(8) > def mymapfunc(i): > return mydatafunc(myutil.data[i]) > > p.map(mymapfunc, range(len(myutil.data))) > There is still the issue that p.map does not do any load balancing. The processes in the pool might spend all the time battling for a mutex. We must thus arrange it so each call to mymapfunc processes a chunk of data instead of a single item. This is one strategy that works well: With n remaining work items and m processes, and c the minimum chunk size, let the process holding a mutex grab max( c, n/m ) work items. Then n is reduced accordingly, and this continues until all items are exchausted. Also, as processes do not work on interleaved items, we avoid false sharing as much as possible. (False sharing means that one processor will write to a cache line used by another processor, so both must stop what they're doing and synchronize cache with RAM.) Sturla From sturla at molden.no Thu Jun 16 13:55:43 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 19:55:43 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA4196.7070208@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> <4DFA3DAC.1000704@molden.no> <4DFA4196.7070208@noaa.gov> Message-ID: <4DFA439F.6050404@molden.no> Den 16.06.2011 19:47, skrev Christopher Barker: > > What do you need to pickle if you're using shared memory? The meta-data for the ndarray (strides, shape, dtype, etc.) and the name of the shared memory segment. As there is no fork, we must tell the other process where to find the shared memory and what to do with it. Sturla From sturla at molden.no Thu Jun 16 14:03:09 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 16 Jun 2011 20:03:09 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA439F.6050404@molden.no> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> <4DFA3DAC.1000704@molden.no> <4DFA4196.7070208@noaa.gov> <4DFA439F.6050404@molden.no> Message-ID: <4DFA455D.8010804@molden.no> Den 16.06.2011 19:55, skrev Sturla Molden: > > The meta-data for the ndarray (strides, shape, dtype, etc.) and the name > of the shared memory segment. As there is no fork, we must tell the > other process where to find the shared memory and what to do with it. > Which by the way is what the shared memory arrays I and Ga?l made will do, but we still have the annoying pickle overhead. Sturla From Chris.Barker at noaa.gov Thu Jun 16 14:08:13 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 16 Jun 2011 11:08:13 -0700 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA455D.8010804@molden.no> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> <4DFA3DAC.1000704@molden.no> <4DFA4196.7070208@noaa.gov> <4DFA439F.6050404@molden.no> <4DFA455D.8010804@molden.no> Message-ID: <4DFA468D.4080001@noaa.gov> Sturla Molden wrote: > Den 16.06.2011 19:55, skrev Sturla Molden: >> The meta-data for the ndarray (strides, shape, dtype, etc.) and the name >> of the shared memory segment. As there is no fork, we must tell the >> other process where to find the shared memory and what to do with it. got it, thanks. > Which by the way is what the shared memory arrays I and Ga?l made will do, > but we still have the annoying pickle overhead. Do you have a pointer to that code? It sounds handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From bsouthey at gmail.com Thu Jun 16 14:54:20 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 16 Jun 2011 13:54:20 -0500 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: <4DFA32D5.4050209@noaa.gov> References: <4DF8E89D.7080904@molden.no> <4DF9229D.7080002@noaa.gov> <4DF94D69.3080802@molden.no> <4DFA1030.7070403@gmail.com> <4DFA1378.4050804@molden.no> <4DFA187B.7010608@gmail.com> <4DFA32D5.4050209@noaa.gov> Message-ID: <4DFA515C.5050201@gmail.com> On 06/16/2011 11:44 AM, Christopher Barker wrote: > NOTE: I'm only taking part in this discussion because it's interesting > and I hope to learn something. I do hope the OP chimes back in to > clarify his needs, but in the meantime... > > > > Bruce Southey wrote: >> Remember that is what the OP wanted to do, not me. > > Actually, I don't think that's what the OP wanted -- I think we have a > conflict between the need for concrete examples, and the desire to > find a generic solution, so I think this is what the OP wants: > > > How to best multiprocess a _generic_ operation that needs to be > performed on a lot of arrays. Something like: > > > output = [] > for a in a_bunch_of_arrays: > output.append( a_function(a) ) > > > More specifically, a_function() is an inner product, *defined by the > user*. > > So there is no way to optimize the inner product itself (that will be > up to the user), nor any way to generally convert the bunch_of_arrays > to a single array with a single higher-dimensional operation. > > In testing his approach, the OP used a numpy multiply, and a simple, > loop-through-the elements multiply, and found that with his > multiprocessing calls, the simple loop was a fair bit faster with two > processors, but that the numpy one was slower with two processors. Of > course, the looping method was much, much, slower than the numpy one > in any case. > > So Sturla's comments are probably right on: > > Sturla Molden wrote: > >> "innerProductList = pool.map(myutil.numpy_inner_product, arrayList)" >> >> 1. Here we potentially have a case of false sharing and/or mutex >> contention, as the work is too fine grained. pool.map does not do >> any load balancing. If pool.map is to scale nicely, each work item >> must take a substantial amount of time. I suspect this is the main >> issue. > >> 2. There is also the question of when the process pool is spawned. >> Though I haven't checked, I suspect it happens prior to calling >> pool.map. But if it does not, this is a factor as well, particularly >> on Windows (less so on Linux and Apple). > > It didn't work well on my Mac, so ti's either not an issue, or not > Windows-specific, anyway. > >> 3. "arrayList" is serialised by pickling, which has a significan >> overhead. It's not shared memory either, as the OP's code implies, >> but the main thing is the slowness of cPickle. > > I'll bet this is a big issue, and one I'm curious about how to > address, I have another problem where I need to multi-process, and I'd > love to know a way to pass data to the other process and back > *without* going through pickle. maybe memmapped files? > >> "IPs = N.array(innerProductList)" >> >> 4. numpy.array is a very slow function. The benchmark should >> preferably not include this overhead. > > I re-ran, moving that out of the timing loop, and, indeed, it helped a > lot, but it still takes longer with the multi-processing. > > I suspect that the overhead of pickling, etc. is overwhelming the > operation itself. That and the load balancing issue that I don't > understand! > > To test this, I did a little experiment -- creating a "fake" > operation, one that simply returns an element from the input array -- > so it should take next to no time, and we can time the overhead of the > pickling, etc: > > $ python shared_mem.py > > Using 2 processes > No shared memory, numpy array multiplication took 0.124427080154 seconds > Shared memory, numpy array multiplication took 0.586215019226 seconds > > No shared memory, fake array multiplication took 0.000391006469727 > seconds > Shared memory, fake array multiplication took 0.54935503006 seconds > > No shared memory, my array multiplication took 23.5055780411 seconds > Shared memory, my array multiplication took 13.0932741165 seconds > > Bingo! > > The overhead of the multi-processing takes about .54 seconds, which > explains the slowdown for the numpy method > > not so mysterious after all. > > Bruce Southey wrote: > >> But if everything is *single-threaded* and thread-safe, then you just >> create a function and use Anne's very useful handythread.py >> (http://www.scipy.org/Cookbook/Multithreading). > > This may be worth a try -- though the GIL could well get in the way. > >> By the way, if the arrays are sufficiently small, there is a lot of >> overhead involved such that there is more time in communication than >> computation. > > yup -- clearly the case here. I wonder if it's just array size though > -- won't cPickle time scale with array size? So it may not be size > pe-se, but rather how much computation you need for a given size array. > > -Chris > > [I've enclosed the OP's slightly altered code] > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Please see: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056766.html "I'm doing element wise multiplication, basically innerProduct = numpy.sum(array1*array2) where array1 and array2 are, in general, multidimensional." Thanks for the code as I forgot that it was sent. I think there is something weird about these timings(probably because these use time and not timeit) - the shared timings should not be constant across number of processors. For the numpy multiplication approach shows rather constant differences but using np.inner() clearly differs with the number of processors used. So I think that numpy may be using multiple threads here. It is far more evident with large arrays: arraySize = (3000,200) numArrays = 50 Using 1 processes No shared memory, numpy array multiplication took 0.279149055481 seconds Shared memory, numpy array multiplication took 1.87239384651 seconds No shared memory, inner array multiplication took 14.9514381886 seconds Shared memory, inner array multiplication took 17.0087819099 seconds Using 4 processes No shared memory, numpy array multiplication took 0.279071807861 seconds Shared memory, numpy array multiplication took 1.48242783546 seconds No shared memory, inner array multiplication took 15.1401138306 seconds Shared memory, inner array multiplication took 5.2479391098 seconds Using 8 processes No shared memory, numpy array multiplication took 0.281194925308 seconds Shared memory, numpy array multiplication took 1.44942212105 seconds No shared memory, inner array multiplication took 15.3794519901 seconds Shared memory, inner array multiplication took 3.51714301109 seconds Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbelson at princeton.edu Thu Jun 16 15:05:44 2011 From: bbelson at princeton.edu (Brandt Belson) Date: Thu, 16 Jun 2011 15:05:44 -0400 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication Message-ID: Hi all, Thanks for the replies. As mentioned, I'm parallelizing so that I can take many inner products simultaneously (which I agree is embarrassingly parallel). The library I'm writing asks the user to supply a function that takes two objects and returns their inner product. After all the discussion though it seems this is too simplistic of an approach. Instead, I plan to write this part of the library as if the inner product function supplied by the user uses all available cores (with numpy and/or numexpr built with MKL or LAPACK). As far as using fortran or C and openMP, this probably isn't worth the time it would take, both for me and the user. I've tried increasing the array sizes and found the same trends, so the slowdown isn't only because the arrays are too small to see the benefit of multiprocessing. I wrote the code to be easy for anyone to experiment with, so feel free to play around with what is included in the profiling, the sizes of arrays, functions used, etc. I also tried using handythread.foreach with arraySize = (3000,1000), and found the following: No shared memory, numpy array multiplication took 1.57585811615 seconds Shared memory, numpy array multiplication took 1.25499510765 seconds This is definitely an improvement from multiprocessing, but without knowing any better, I was hoping to see a roughly 8x speedup on my 8-core workstation. Based on what Chris sent, it seems there is some large overhead caused by multiprocessing pickling numpy arrays. To test what Robin mentioned > If you are on Linux or Mac then fork works nicely so you have read > only shared memory you just have to put it in a module before the fork > (so before pool = Pool() ) and then all the subprocesses can access it > without any pickling required. ie > myutil.data = listofdata > p = multiprocessing.Pool(8) > def mymapfunc(i): > return mydatafunc(myutil.data[i]) > > p.map(mymapfunc, range(len(myutil.data))) I tried creating the arrayList in the myutil module and using multiprocessing to find the inner products of myutil.arrayList, however this was still slower than not using multiprocessing, so I believe there is still some large overhead. Here are the results: No shared memory, numpy array multiplication took 1.55906510353 seconds Shared memory, numpy array multiplication took 9.82426381111 seconds Shared memory, myutil.arrayList numpy array multiplication took 8.77094507217 seconds I'm attaching this code. I'm going to work around this numpy/multiprocessing behavior with numpy/numexpr built with MKL or LAPACK. It would be good to know exactly what's causing this though. It would be nice if there was a way to get the ideal speedup via multiprocessing, regardless of the internal workings of the single-threaded inner product function, as this was the behavior I expected. I imagine other people might come across similar situations, but again I'm going to try to get around this by letting MKL or LAPACK make use of all available cores. Thanks again, Brandt -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: myutil.py Type: application/octet-stream Size: 735 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: shared_mem.py Type: application/octet-stream Size: 2773 bytes Desc: not available URL: From robince at gmail.com Thu Jun 16 15:19:02 2011 From: robince at gmail.com (Robin) Date: Thu, 16 Jun 2011 21:19:02 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: The fact that you are still passing the myutil.arrayList argument to pool.map means the data is still being pickled. All arguments to pool.map are pickled to be passed to the subprocess. You need to change the function so that it accesses the data directly, then just pass indices to pool.map. Cheers Robin On Thu, Jun 16, 2011 at 9:05 PM, Brandt Belson wrote: > Hi all, > Thanks for the replies. As mentioned, I'm parallelizing so that I can take > many inner products simultaneously (which I agree is embarrassingly > parallel). The library I'm writing asks the user to supply a function that > takes two objects and returns their inner product. After all the discussion > though it seems this is too simplistic of an approach. Instead, I plan to > write this part of the library as if the inner product function supplied by > the user uses all available cores (with numpy and/or numexpr built with MKL > or LAPACK). > As far as using fortran or C and openMP, this probably isn't worth the time > it would take, both for me and the user. > I've tried increasing the array sizes and found the same trends, so the > slowdown isn't only because the arrays are too small to see the benefit of > multiprocessing. I wrote the code to be easy for anyone to experiment with, > so feel free to play around with what is included in the profiling, the > sizes of arrays, functions used, etc. > I also tried using handythread.foreach with arraySize = (3000,1000), and > found the following: > No shared memory, numpy array multiplication took 1.57585811615 seconds > Shared memory, numpy array multiplication took 1.25499510765 seconds > This is definitely an improvement from multiprocessing, but without knowing > any better, I was hoping to see a roughly 8x speedup on my 8-core > workstation. > Based on what Chris sent, it seems there is some large overhead caused by > multiprocessing pickling numpy arrays. To test what Robin mentioned >> If you are on Linux or Mac then fork works nicely so you have read >> only shared memory you just have to put it in a module before the fork >> (so before pool = Pool() ) and then all the subprocesses can access it >> without any pickling required. ie >> myutil.data = listofdata >> p = multiprocessing.Pool(8) >> def mymapfunc(i): >> ? return mydatafunc(myutil.data[i]) >> >> p.map(mymapfunc, range(len(myutil.data))) > I tried creating the arrayList in the myutil module and using > multiprocessing to find the inner products of myutil.arrayList, however this > was still slower than not using multiprocessing, so I believe there is still > some large overhead. Here are the results: > No shared memory, numpy array multiplication took 1.55906510353 seconds > Shared memory, numpy array multiplication took 9.82426381111 seconds > Shared memory, myutil.arrayList numpy array multiplication took > 8.77094507217 seconds > I'm attaching this code. > I'm going to work around this numpy/multiprocessing behavior with > numpy/numexpr built with MKL or LAPACK. It would be good to know exactly > what's causing this though. It would be nice if there was a way to get the > ideal speedup via multiprocessing, regardless of the internal workings of > the single-threaded inner product function, as this was the behavior I > expected. I imagine other people might come across similar situations, but > again I'm going to try to get around this by letting MKL or LAPACK make use > of all available cores. > Thanks again, > Brandt > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From mwwiebe at gmail.com Thu Jun 16 16:00:39 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 16 Jun 2011 15:00:39 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: Message-ID: I've received some good feedback from Chuck and Ralf on the code and documentation, respectively, and build testing with proposed fixes for issues from the previous merge from Derek. I believe the current set of changes are in good shape to merge, so would like to proceed with that later today. Cheers, Mark On Tue, Jun 14, 2011 at 6:34 PM, Mark Wiebe wrote: > These functions are now fully implemented and documented. As always, code > reviews are welcome here: > > https://github.com/numpy/numpy/pull/87 > > and for those that don't want to dig into review C code, the commit for the > documentation is here: > > > https://github.com/m-paradox/numpy/commit/6b5a42a777b16812e774193b06da1b68b92bc689 > > This is probably also another good place to do a merge to master, so if > people could test it on Mac/Windows/other platforms that would be much > appreciated. > > Thanks, > Mark > > On Fri, Jun 10, 2011 at 5:49 PM, Mark Wiebe wrote: > >> I've implemented the busday_offset function with support for the weekmask >> and roll parameters, the commits are tagged 'datetime-bday' in the pull >> request here: >> >> https://github.com/numpy/numpy/pull/87 >> >> -Mark >> >> >> On Thu, Jun 9, 2011 at 5:23 PM, Mark Wiebe wrote: >> >>> Here's a possible design for a business day API for numpy datetimes: >>> >>> >>> The 'B' business day unit will be removed. All business day-related >>> calculations will be done using the 'D' day unit. >>> >>> A class *BusinessDayDef* to encapsulate the definition of the business >>> week and holidays. The business day functions will either take one of these >>> objects, or separate weekmask and holidays parameters, to specify the >>> business day definition. This class serves as both a performance >>> optimization and a way to encapsulate the weekmask and holidays together, >>> for example if you want to make a dictionary mapping exchange names to their >>> trading days definition. >>> >>> The weekmask can be specified in a number of ways, and internally becomes >>> a boolean array with 7 elements with True for the days Monday through Sunday >>> which are valid business days. Some different notations are for the 5-day >>> week include [1,1,1,1,1,0,0], "1111100" "MonTueWedThuFri". The holidays are >>> always specified as a one-dimensional array of dtype 'M8[D]', and are >>> internally used in sorted form. >>> >>> >>> A function *is_busday*(datearray, weekmask=, holidays=, busdaydef=) >>> returns a boolean array matching the input datearray, with True for the >>> valid business days. >>> >>> A function *busday_offset*(datearray, offsetarray, >>> roll='raise', weekmask=, holidays=, busdaydef=) which first applies the >>> 'roll' policy to start at a valid business date, then offsets the date by >>> the number of business days specified in offsetarray. The arrays datearray >>> and offsetarray are broadcast together. The 'roll' parameter can be >>> 'forward'/'following', 'backward'/'preceding', 'modifiedfollowing', >>> 'modifiedpreceding', or 'raise' (the default). >>> >>> A function *busday_count*(datearray1, datearray2, weekmask=, holidays=, >>> busdaydef=) which calculates the number of business days between datearray1 >>> and datearray2, not including the day of datearray2. >>> >>> >>> For example, to find the first Monday in Feb 2011, >>> >>> >>>np.busday_offset('2011-02', 0, roll='forward', weekmask='Mon') >>> >>> or to find the number of weekdays in Feb 2011, >>> >>> >>>np.busday_count('2011-02', '2011-03') >>> >>> This set of three functions appears to be powerful enough to express the >>> business-day computations that I've been shown thus far. >>> >>> Cheers, >>> Mark >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 16 16:11:28 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 16 Jun 2011 14:11:28 -0600 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 2:00 PM, Mark Wiebe wrote: > I've received some good feedback from Chuck and Ralf on the code and > documentation, respectively, and build testing with proposed fixes for > issues from the previous merge from Derek. I believe the current set of > changes are in good shape to merge, so would like to proceed with that later > today. > > Go for it. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Thu Jun 16 17:31:16 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 16 Jun 2011 16:31:16 -0500 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: <4DFA7624.1040102@gmail.com> On 06/16/2011 02:05 PM, Brandt Belson wrote: > Hi all, > Thanks for the replies. As mentioned, I'm parallelizing so that I can > take many inner products simultaneously (which I agree is > embarrassingly parallel). The library I'm writing asks the user to > supply a function that takes two objects and returns their inner > product. After all the discussion though it seems this is too > simplistic of an approach. Instead, I plan to write this part of the > library as if the inner product function supplied by the user uses all > available cores (with numpy and/or numexpr built with MKL or LAPACK). > > As far as using fortran or C and openMP, this probably isn't worth the > time it would take, both for me and the user. > > I've tried increasing the array sizes and found the same trends, so > the slowdown isn't only because the arrays are too small to see the > benefit of multiprocessing. I wrote the code to be easy for anyone to > experiment with, so feel free to play around with what is included in > the profiling, the sizes of arrays, functions used, etc. > > I also tried using handythread.foreach with arraySize = (3000,1000), > and found the following: > No shared memory, numpy array multiplication took 1.57585811615 seconds > Shared memory, numpy array multiplication took 1.25499510765 seconds > This is definitely an improvement from multiprocessing, but without > knowing any better, I was hoping to see a roughly 8x speedup on my > 8-core workstation. > > Based on what Chris sent, it seems there is some large overhead caused > by multiprocessing pickling numpy arrays. To test what Robin mentioned > > > If you are on Linux or Mac then fork works nicely so you have read > > only shared memory you just have to put it in a module before the fork > > (so before pool = Pool() ) and then all the subprocesses can access it > > without any pickling required. ie > > myutil.data = listofdata > > p = multiprocessing.Pool(8) > > def mymapfunc(i): > > return mydatafunc(myutil.data[i]) > > > > p.map(mymapfunc, range(len(myutil.data))) > > I tried creating the arrayList in the myutil module and using > multiprocessing to find the inner products of myutil.arrayList, > however this was still slower than not using multiprocessing, so I > believe there is still some large overhead. Here are the results: > No shared memory, numpy array multiplication took 1.55906510353 seconds > Shared memory, numpy array multiplication took 9.82426381111 seconds > Shared memory, myutil.arrayList numpy array multiplication took > 8.77094507217 seconds > I'm attaching this code. > > I'm going to work around this numpy/multiprocessing behavior with > numpy/numexpr built with MKL or LAPACK. It would be good to know > exactly what's causing this though. It would be nice if there was a > way to get the ideal speedup via multiprocessing, regardless of the > internal workings of the single-threaded inner product function, as > this was the behavior I expected. I imagine other people might come > across similar situations, but again I'm going to try to get around > this by letting MKL or LAPACK make use of all available cores. > > Thanks again, > Brandt > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion I think this is not being benchmarked correctly because there should be a noticeable different when different number of threads are selected. But really you should read these sources: http://www.scipy.org/ParallelProgramming http://stackoverflow.com/questions/5260068/multithreaded-blas-in-python-numpy Also numpy has extra things going on like checks and copies that probably make using np.inner() slower. Thus, your 'numpy_inner_product' is probably as efficient as you can get without extreme measures like cython. Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Thu Jun 16 18:52:19 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 17 Jun 2011 00:52:19 +0200 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> Message-ID: <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> Hi Mark, On 16.06.2011, at 5:40PM, Mark Wiebe wrote: >> >>> np.datetime64('2011-06-16 02:03:04Z', 'D') >> np.datetime64('0000-06-16','D') >> >> I've tried to track this down in datetime.c, but unsuccessfully so (i.e. I could not connect it >> to any of the dts->year assignments therein). >> > This is definitely perplexing. Probably the first thing to check is whether it's breaking during parsing or printing. This should always produce the same result: > > >>> np.datetime64('1970-03-23 20:00:00Z').astype('i8') > 7070400 > > But maybe the test_days_creation is already checking that thoroughly enough. > > Then, maybe printf-ing the year value at various stages of the printing, like in set_datetimestruct_days, after convert_datetime_to_datetimestruct, and in make_iso_8601_date. This would at least isolate where the year is getting lost. > ok, that was a lengthy hunt, but it's in printing the string in make_iso_8601_date: tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, sublen, dts->year, substr); produces >>> np.datetime64('1970-03-23 20:00:00Z', 'D') printed 4[62]: dts->year=1970: 0000 numpy.datetime64('0000-03-23','D') It seems snprintf is not using the correct format for INT64 (as I happened to do in fprintf before realising I had to use "%lld" ;-) - could it be this is a general issue, which just does not show up on little-endian machines because they happen to pass the right half of the int64 to printf? BTW, how is this supposed to be handled (in 4 digits) if the year is indeed beyond the 32bit range (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply cast it to int32 before print. Cheers, Derek From mwwiebe at gmail.com Thu Jun 16 20:02:41 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 16 Jun 2011 19:02:41 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> Message-ID: On Thu, Jun 16, 2011 at 5:52 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > Hi Mark, > > On 16.06.2011, at 5:40PM, Mark Wiebe wrote: > > >> >>> np.datetime64('2011-06-16 02:03:04Z', 'D') > >> np.datetime64('0000-06-16','D') > >> > >> I've tried to track this down in datetime.c, but unsuccessfully so (i.e. > I could not connect it > >> to any of the dts->year assignments therein). > >> > > This is definitely perplexing. Probably the first thing to check is > whether it's breaking during parsing or printing. This should always produce > the same result: > > > > >>> np.datetime64('1970-03-23 20:00:00Z').astype('i8') > > 7070400 > > > > But maybe the test_days_creation is already checking that thoroughly > enough. > > > > Then, maybe printf-ing the year value at various stages of the printing, > like in set_datetimestruct_days, after convert_datetime_to_datetimestruct, > and in make_iso_8601_date. This would at least isolate where the year is > getting lost. > > > ok, that was a lengthy hunt, but it's in printing the string in > make_iso_8601_date: > > tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); > fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, sublen, > dts->year, substr); > > produces > > >>> np.datetime64('1970-03-23 20:00:00Z', 'D') > printed 4[62]: dts->year=1970: 0000 > numpy.datetime64('0000-03-23','D') > > It seems snprintf is not using the correct format for INT64 (as I happened > to do in fprintf before > realising I had to use "%lld" ;-) - could it be this is a general issue, > which just does not show up > on little-endian machines because they happen to pass the right half of the > int64 to printf? > BTW, how is this supposed to be handled (in 4 digits) if the year is indeed > beyond the 32bit range > (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply cast it > to int32 before print. > I'd prefer to fix the NPY_INT64_FMT macro. There's no point in having it if it doesn't work... What is NumPy setting it to for that platform? -Mark > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Thu Jun 16 21:18:51 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 17 Jun 2011 03:18:51 +0200 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> Message-ID: <91FE4549-74FE-421A-A1B7-702B63C500E9@astro.physik.uni-goettingen.de> On 17.06.2011, at 2:02AM, Mark Wiebe wrote: >> ok, that was a lengthy hunt, but it's in printing the string in make_iso_8601_date: >> >> tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); >> fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, sublen, dts->year, substr); >> >> produces >> >> >>> np.datetime64('1970-03-23 20:00:00Z', 'D') >> printed 4[62]: dts->year=1970: 0000 >> numpy.datetime64('0000-03-23','D') >> >> It seems snprintf is not using the correct format for INT64 (as I happened to do in fprintf before >> realising I had to use "%lld" ;-) - could it be this is a general issue, which just does not show up >> on little-endian machines because they happen to pass the right half of the int64 to printf? >> BTW, how is this supposed to be handled (in 4 digits) if the year is indeed beyond the 32bit range >> (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply cast it to int32 before print. >> > I'd prefer to fix the NPY_INT64_FMT macro. There's no point in having it if it doesn't work... What is NumPy setting it to for that platform? > Of course (just felt somewhat lost among all the #defines). It clearly seems to be mis-constructed on PowerPC 32: NPY_SIZEOF_LONG is 4, thus NPY_INT64_FMT is set to NPY_LONGLONG_FMT - "Ld", but this does not seem to handle int64 on big-endian Macs - explicitly printing "%Ld", dts->year also produces 0. Changing the snprintf format to "%04" "lld" produces the correct output, so if nothing else avails, I suggest to put something like # elseif (defined(__ppc__) || defined(__ppc64__)) #define LONGLONG_FMT "lld" #define ULONGLONG_FMT "llu" # else into npy_common.h (or possibly simply "defined(__APPLE__)", since %lld seems to work on 32bit i386 Macs just as well). Cheers, Derek From charlesr.harris at gmail.com Thu Jun 16 22:50:58 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 16 Jun 2011 20:50:58 -0600 Subject: [Numpy-discussion] Sorting library. Message-ID: Hi All, Pull request 89 is a set of patches that move the numpy sorting patches into a library. I've tested it on fedora 15 with gcc and would greatly appreciate it if some of you with different os/arch/compilers could test it out to make sure it works correctly. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 16 22:54:18 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 16 Jun 2011 20:54:18 -0600 Subject: [Numpy-discussion] Unicode characters in a numpy array In-Reply-To: References: Message-ID: On Wed, Jun 15, 2011 at 1:30 PM, G?khan Sever wrote: > Hello, > > The following snippet works fine for a regular string and prints out > the string without a problem. > > python > Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) > [GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> mystr = u"??????" > >>> mystr > u'\xf6\xf6\xf6\u011f\u011f\u011f' > >>> type(mystr) > > >>> print mystr > ?????? > > What is the correct way to print out the following array? > > >>> import numpy as np > >>> arr = np.array(u"??????") > >>> arr > array(u'\xf6\xf6\xf6\u011f\u011f\u011f', > dtype=' >>> print arr > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib64/python2.7/site-packages/numpy/core/numeric.py", > line 1379, in array_str > return array2string(a, max_line_width, precision, suppress_small, > ' ', "", str) > File "/usr/lib64/python2.7/site-packages/numpy/core/arrayprint.py", > line 426, in array2string > lst = style(x) > UnicodeEncodeError: 'ascii' codec can't encode characters in position > 0-5: ordinal not in range(128) > > I don't know. It might be that we need to fix the printing functions for unicode and maybe have some way to set the codec as well. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From gruben at bigpond.net.au Fri Jun 17 00:36:15 2011 From: gruben at bigpond.net.au (gary ruben) Date: Fri, 17 Jun 2011 14:36:15 +1000 Subject: [Numpy-discussion] genfromtxt converter question Message-ID: I'm trying to read a file containing data formatted as in the following example using genfromtxt and I'm doing something wrong. It almost works. Can someone point out my error, or suggest a simpler solution to the ugly converter function? I thought I'd leave in the commented-out line for future reference, which I thought was a neat way to get genfromtxt to show what it is trying to pass to the converter. import numpy as np from StringIO import StringIO a = StringIO('''\ (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) ''') #~ b = np.genfromtxt(a,converters={1:lambda x:str(x)},dtype=object,delimiter=18) b = np.genfromtxt(a,converters={1:lambda x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) print b -- This produces [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' (-1.4249, 1.7330)') (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' (-0.4805,-1.1976)') (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' (-1.3467, 1.6579)') (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( 2.5619,-3.3708)')] which I just need to unpack into a 4x4 array, but I get an error if I try to apply a different view. thanks, Gary From ckkart at hoc.net Fri Jun 17 04:16:44 2011 From: ckkart at hoc.net (Christian K.) Date: Fri, 17 Jun 2011 08:16:44 +0000 (UTC) Subject: [Numpy-discussion] 3d ODR References: Message-ID: Robert Kern gmail.com> writes: > > On Thu, Jun 16, 2011 at 06:28, Christian K. hoc.net> wrote: > > Hi, > > > > I need to do fit a 3d surface to a point cloud. This sounds like a job for 3d > > orthogonal distance regression. Does anybody know of an implementation? > > As eat points out, scipy.odr should do the job nicely. You should read > at least part of the ODRPACK User's Guide in addition to the scipy > docs for it: Thank you both! I must admit that I was ignorant about the fact that scipy.odr can also handle multidimensional data. Christian From shish at keba.be Fri Jun 17 06:40:51 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 17 Jun 2011 06:40:51 -0400 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: Message-ID: If I understand correctly, your error is that you convert only the second column, because your converters dictionary contains a single key (1). If you have it contain keys from 0 to 3 associated to the same function, it should work. -=- Olivier 2011/6/17 gary ruben > I'm trying to read a file containing data formatted as in the > following example using genfromtxt and I'm doing something wrong. It > almost works. Can someone point out my error, or suggest a simpler > solution to the ugly converter function? I thought I'd leave in the > commented-out line for future reference, which I thought was a neat > way to get genfromtxt to show what it is trying to pass to the > converter. > > import numpy as np > from StringIO import StringIO > > a = StringIO('''\ > (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) > (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) > ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) > ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) > ''') > > #~ b = np.genfromtxt(a,converters={1:lambda > x:str(x)},dtype=object,delimiter=18) > b = np.genfromtxt(a,converters={1:lambda > x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) > > print b > > -- > > This produces > [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' > (-1.4249, 1.7330)') > (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' > (-0.4805,-1.1976)') > (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' (-1.3467, > 1.6579)') > (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( 2.5619,-3.3708)')] > > which I just need to unpack into a 4x4 array, but I get an error if I > try to apply a different view. > > thanks, > Gary > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robince at gmail.com Fri Jun 17 08:08:36 2011 From: robince at gmail.com (Robin) Date: Fri, 17 Jun 2011 14:08:36 +0200 Subject: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication In-Reply-To: References: Message-ID: I didn't have time yesterday but the attached illustrates what I mean about putting the shared data in a module (it should work with the previous myutil). I don't get a big speed up but at least it is faster using multiple subprocesses: Not threaded: 0.450406074524 Using 8 processes: 0.282383 If I add a 1s sleep in the inner product function shows a much more significant improvement (ie if it were a longer computation): Not threaded: 50.6744170189 Using 8 processes: 8.152393 Still not quite linear but certainly an improvement. Cheers Robin On Thu, Jun 16, 2011 at 9:19 PM, Robin wrote: > The fact that you are still passing the myutil.arrayList argument to > pool.map means the data is still being pickled. All arguments to > pool.map are pickled to be passed to the subprocess. You need to > change the function so that it accesses the data directly, then just > pass indices to pool.map. > > Cheers > > Robin > > On Thu, Jun 16, 2011 at 9:05 PM, Brandt Belson wrote: >> Hi all, >> Thanks for the replies. As mentioned, I'm parallelizing so that I can take >> many inner products simultaneously (which I agree is embarrassingly >> parallel). The library I'm writing asks the user to supply a function that >> takes two objects and returns their inner product. After all the discussion >> though it seems this is too simplistic of an approach. Instead, I plan to >> write this part of the library as if the inner product function supplied by >> the user uses all available cores (with numpy and/or numexpr built with MKL >> or LAPACK). >> As far as using fortran or C and openMP, this probably isn't worth the time >> it would take, both for me and the user. >> I've tried increasing the array sizes and found the same trends, so the >> slowdown isn't only because the arrays are too small to see the benefit of >> multiprocessing. I wrote the code to be easy for anyone to experiment with, >> so feel free to play around with what is included in the profiling, the >> sizes of arrays, functions used, etc. >> I also tried using handythread.foreach with arraySize = (3000,1000), and >> found the following: >> No shared memory, numpy array multiplication took 1.57585811615 seconds >> Shared memory, numpy array multiplication took 1.25499510765 seconds >> This is definitely an improvement from multiprocessing, but without knowing >> any better, I was hoping to see a roughly 8x speedup on my 8-core >> workstation. >> Based on what Chris sent, it seems there is some large overhead caused by >> multiprocessing pickling numpy arrays. To test what Robin mentioned >>> If you are on Linux or Mac then fork works nicely so you have read >>> only shared memory you just have to put it in a module before the fork >>> (so before pool = Pool() ) and then all the subprocesses can access it >>> without any pickling required. ie >>> myutil.data = listofdata >>> p = multiprocessing.Pool(8) >>> def mymapfunc(i): >>> ? return mydatafunc(myutil.data[i]) >>> >>> p.map(mymapfunc, range(len(myutil.data))) >> I tried creating the arrayList in the myutil module and using >> multiprocessing to find the inner products of myutil.arrayList, however this >> was still slower than not using multiprocessing, so I believe there is still >> some large overhead. Here are the results: >> No shared memory, numpy array multiplication took 1.55906510353 seconds >> Shared memory, numpy array multiplication took 9.82426381111 seconds >> Shared memory, myutil.arrayList numpy array multiplication took >> 8.77094507217 seconds >> I'm attaching this code. >> I'm going to work around this numpy/multiprocessing behavior with >> numpy/numexpr built with MKL or LAPACK. It would be good to know exactly >> what's causing this though. It would be nice if there was a way to get the >> ideal speedup via multiprocessing, regardless of the internal workings of >> the single-threaded inner product function, as this was the behavior I >> expected. I imagine other people might come across similar situations, but >> again I'm going to try to get around this by letting MKL or LAPACK make use >> of all available cores. >> Thanks again, >> Brandt >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: poolmap.py Type: text/x-python Size: 1076 bytes Desc: not available URL: From gruben at bigpond.net.au Fri Jun 17 09:22:40 2011 From: gruben at bigpond.net.au (gary ruben) Date: Fri, 17 Jun 2011 23:22:40 +1000 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: Message-ID: Thanks Olivier, Your suggestion gets me a little closer to what I want, but doesn't quite work. Replacing the conversion with c = lambda x:np.cast[np.complex64](complex(*eval(x))) b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, 3:c},dtype=None,delimiter=18,usecols=range(4)) produces [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] [(-5.4797000885+0j) (1.85850000381-1.5501999855j) (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) (-1.34669995308+1.65789997578j)] [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] which is not yet an array of complex numbers. It seems close to the solution though. Gary On Fri, Jun 17, 2011 at 8:40 PM, Olivier Delalleau wrote: > If I understand correctly, your error is that you convert only the second > column, because your converters dictionary contains a single key (1). > If you have it contain keys from 0 to 3 associated to the same function, it > should work. > > -=- Olivier > > 2011/6/17 gary ruben >> >> I'm trying to read a file containing data formatted as in the >> following example using genfromtxt and I'm doing something wrong. It >> almost works. Can someone point out my error, or suggest a simpler >> solution to the ugly converter function? I thought I'd leave in the >> commented-out line for future reference, which I thought was a neat >> way to get genfromtxt to show what it is trying to pass to the >> converter. >> >> import numpy as np >> from StringIO import StringIO >> >> a = StringIO('''\ >> ?(-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) >> ?(-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) >> ?( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) >> ?( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) >> ''') >> >> #~ b = np.genfromtxt(a,converters={1:lambda >> x:str(x)},dtype=object,delimiter=18) >> b = np.genfromtxt(a,converters={1:lambda >> x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) >> >> print b >> >> -- >> >> This produces >> [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' >> (-1.4249, 1.7330)') >> ?(' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' >> (-0.4805,-1.1976)') >> ?(' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' (-1.3467, >> 1.6579)') >> ?(' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( 2.5619,-3.3708)')] >> >> which I just need to unpack into a 4x4 array, but I get an error if I >> try to apply a different view. >> >> thanks, >> Gary >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From bsouthey at gmail.com Fri Jun 17 09:37:00 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 17 Jun 2011 08:37:00 -0500 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: Message-ID: <4DFB587C.60603@gmail.com> On 06/17/2011 08:22 AM, gary ruben wrote: > Thanks Olivier, > Your suggestion gets me a little closer to what I want, but doesn't > quite work. Replacing the conversion with > > c = lambda x:np.cast[np.complex64](complex(*eval(x))) > b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, > 3:c},dtype=None,delimiter=18,usecols=range(4)) > > produces > > [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) > (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] > [(-5.4797000885+0j) (1.85850000381-1.5501999855j) > (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] > [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) > (-1.34669995308+1.65789997578j)] > [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] > > which is not yet an array of complex numbers. It seems close to the > solution though. > > Gary > > On Fri, Jun 17, 2011 at 8:40 PM, Olivier Delalleau wrote: >> If I understand correctly, your error is that you convert only the second >> column, because your converters dictionary contains a single key (1). >> If you have it contain keys from 0 to 3 associated to the same function, it >> should work. >> >> -=- Olivier >> >> 2011/6/17 gary ruben >>> I'm trying to read a file containing data formatted as in the >>> following example using genfromtxt and I'm doing something wrong. It >>> almost works. Can someone point out my error, or suggest a simpler >>> solution to the ugly converter function? I thought I'd leave in the >>> commented-out line for future reference, which I thought was a neat >>> way to get genfromtxt to show what it is trying to pass to the >>> converter. >>> >>> import numpy as np >>> from StringIO import StringIO >>> >>> a = StringIO('''\ >>> (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) >>> (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) >>> ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) >>> ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) >>> ''') >>> >>> #~ b = np.genfromtxt(a,converters={1:lambda >>> x:str(x)},dtype=object,delimiter=18) >>> b = np.genfromtxt(a,converters={1:lambda >>> x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) >>> >>> print b >>> >>> -- >>> >>> This produces >>> [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' >>> (-1.4249, 1.7330)') >>> (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' >>> (-0.4805,-1.1976)') >>> (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' (-1.3467, >>> 1.6579)') >>> (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( 2.5619,-3.3708)')] >>> >>> which I just need to unpack into a 4x4 array, but I get an error if I >>> try to apply a different view. >>> >>> thanks, >>> Gary >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Just an observation for the StringIO object, you have multiple spaces within the parentheses but, by default, you are using whitespace delimiters in genfromtxt. So, yes, genfromtxt is going have issues. If you can rewrite the input, then you need a non-space and non-comma delimiter then specify that delimiter to genfromtxt. Otherwise you are probably going to have to write you own parser - for each line, split on ' (' etc. Bruce From shish at keba.be Fri Jun 17 09:51:02 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 17 Jun 2011 09:51:02 -0400 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: <4DFB587C.60603@gmail.com> References: <4DFB587C.60603@gmail.com> Message-ID: 2011/6/17 Bruce Southey > On 06/17/2011 08:22 AM, gary ruben wrote: > > Thanks Olivier, > > Your suggestion gets me a little closer to what I want, but doesn't > > quite work. Replacing the conversion with > > > > c = lambda x:np.cast[np.complex64](complex(*eval(x))) > > b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, > > 3:c},dtype=None,delimiter=18,usecols=range(4)) > > > > produces > > > > [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) > > (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] > > [(-5.4797000885+0j) (1.85850000381-1.5501999855j) > > (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] > > [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) > > (-1.34669995308+1.65789997578j)] > > [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] > > > > which is not yet an array of complex numbers. It seems close to the > > solution though. > > > > Gary > > > > On Fri, Jun 17, 2011 at 8:40 PM, Olivier Delalleau > wrote: > >> If I understand correctly, your error is that you convert only the > second > >> column, because your converters dictionary contains a single key (1). > >> If you have it contain keys from 0 to 3 associated to the same function, > it > >> should work. > >> > >> -=- Olivier > >> > >> 2011/6/17 gary ruben > >>> I'm trying to read a file containing data formatted as in the > >>> following example using genfromtxt and I'm doing something wrong. It > >>> almost works. Can someone point out my error, or suggest a simpler > >>> solution to the ugly converter function? I thought I'd leave in the > >>> commented-out line for future reference, which I thought was a neat > >>> way to get genfromtxt to show what it is trying to pass to the > >>> converter. > >>> > >>> import numpy as np > >>> from StringIO import StringIO > >>> > >>> a = StringIO('''\ > >>> (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, > 1.7330) > >>> (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) > (-0.4805,-1.1976) > >>> ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, > 1.6579) > >>> ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( > 2.5619,-3.3708) > >>> ''') > >>> > >>> #~ b = np.genfromtxt(a,converters={1:lambda > >>> x:str(x)},dtype=object,delimiter=18) > >>> b = np.genfromtxt(a,converters={1:lambda > >>> x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) > >>> > >>> print b > >>> > >>> -- > >>> > >>> This produces > >>> [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' > >>> (-1.4249, 1.7330)') > >>> (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' > >>> (-0.4805,-1.1976)') > >>> (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' (-1.3467, > >>> 1.6579)') > >>> (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( > 2.5619,-3.3708)')] > >>> > >>> which I just need to unpack into a 4x4 array, but I get an error if I > >>> try to apply a different view. > >>> > >>> thanks, > >>> Gary > >>> _______________________________________________ > >>> NumPy-Discussion mailing list > >>> NumPy-Discussion at scipy.org > >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Just an observation for the StringIO object, you have multiple spaces > within the parentheses but, by default, you are using whitespace > delimiters in genfromtxt. So, yes, genfromtxt is going have issues. > > If you can rewrite the input, then you need a non-space and non-comma > delimiter then specify that delimiter to genfromtxt. Otherwise you are > probably going to have to write you own parser - for each line, split on > ' (' etc. > > Bruce It's funny though because that part (the parsing) actually seems to work. However I've been playing a bit with his example and indeed I can't get numpy to return a complex array. It keeps resulting in an "object" dtype. The only way I found was to convert it temporarily into a list to recast it in complex64, adding the line: b = np.array(map(list, b), dtype=np.complex64) -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Fri Jun 17 10:31:37 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 17 Jun 2011 09:31:37 -0500 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> Message-ID: <4DFB6549.2090303@gmail.com> On 06/17/2011 08:51 AM, Olivier Delalleau wrote: > > > 2011/6/17 Bruce Southey > > > On 06/17/2011 08:22 AM, gary ruben wrote: > > Thanks Olivier, > > Your suggestion gets me a little closer to what I want, but doesn't > > quite work. Replacing the conversion with > > > > c = lambda x:np.cast[np.complex64](complex(*eval(x))) > > b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, > > 3:c},dtype=None,delimiter=18,usecols=range(4)) > > > > produces > > > > [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) > > (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] > > [(-5.4797000885+0j) (1.85850000381-1.5501999855j) > > (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] > > [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) > > (-1.34669995308+1.65789997578j)] > > [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] > > > > which is not yet an array of complex numbers. It seems close to the > > solution though. > > > > Gary > > > > On Fri, Jun 17, 2011 at 8:40 PM, Olivier Delalleau > wrote: > >> If I understand correctly, your error is that you convert only > the second > >> column, because your converters dictionary contains a single > key (1). > >> If you have it contain keys from 0 to 3 associated to the same > function, it > >> should work. > >> > >> -=- Olivier > >> > >> 2011/6/17 gary ruben > > >>> I'm trying to read a file containing data formatted as in the > >>> following example using genfromtxt and I'm doing something > wrong. It > >>> almost works. Can someone point out my error, or suggest a simpler > >>> solution to the ugly converter function? I thought I'd leave > in the > >>> commented-out line for future reference, which I thought was a > neat > >>> way to get genfromtxt to show what it is trying to pass to the > >>> converter. > >>> > >>> import numpy as np > >>> from StringIO import StringIO > >>> > >>> a = StringIO('''\ > >>> (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) > (-1.4249, 1.7330) > >>> (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) > (-0.4805,-1.1976) > >>> ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) > (-1.3467, 1.6579) > >>> ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( > 2.5619,-3.3708) > >>> ''') > >>> > >>> #~ b = np.genfromtxt(a,converters={1:lambda > >>> x:str(x)},dtype=object,delimiter=18) > >>> b = np.genfromtxt(a,converters={1:lambda > >>> x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) > >>> > >>> print b > >>> > >>> -- > >>> > >>> This produces > >>> [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' > (-4.6027,-0.1426)', ' > >>> (-1.4249, 1.7330)') > >>> (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' > >>> (-0.4805,-1.1976)') > >>> (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' > (-1.3467, > >>> 1.6579)') > >>> (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( > 2.5619,-3.3708)')] > >>> > >>> which I just need to unpack into a 4x4 array, but I get an > error if I > >>> try to apply a different view. > >>> > >>> thanks, > >>> Gary > >>> _______________________________________________ > >>> NumPy-Discussion mailing list > >>> NumPy-Discussion at scipy.org > >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > Just an observation for the StringIO object, you have multiple spaces > within the parentheses but, by default, you are using whitespace > delimiters in genfromtxt. So, yes, genfromtxt is going have issues. > > If you can rewrite the input, then you need a non-space and non-comma > delimiter then specify that delimiter to genfromtxt. Otherwise you are > probably going to have to write you own parser - for each line, > split on > ' (' etc. > > Bruce > > > It's funny though because that part (the parsing) actually seems to work. > > However I've been playing a bit with his example and indeed I can't > get numpy to return a complex array. It keeps resulting in an "object" > dtype. > The only way I found was to convert it temporarily into a list to > recast it in complex64, adding the line: > b = np.array(map(list, b), dtype=np.complex64) > > -=- Olivier > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Yes, my mistake as I did not see that the delimiter was set to 18. The problem is about the conversion into complex which I do not think is the converter per se. So a brute force way is: b= np.genfromtxt(a, dtype=str,delimiter=18) nrow, ncol=b.shape print b d=np.empty((nrow, ncol), dtype=np.complex) for row in range(nrow): for col in range(ncol): d[row][col]=complex(*eval(b[row][col])) print d Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From gruben at bigpond.net.au Fri Jun 17 11:39:39 2011 From: gruben at bigpond.net.au (gary ruben) Date: Sat, 18 Jun 2011 01:39:39 +1000 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: <4DFB6549.2090303@gmail.com> References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> Message-ID: Thanks for the hints Olivier and Bruce. Based on them, the following is a working solution, although I still have that itchy sense that genfromtxt should be able to do it directly. import numpy as np from StringIO import StringIO a = StringIO('''\ (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) ''') b = np.genfromtxt(a, dtype=str, delimiter=18)[:,:-1] b = np.vectorize(lambda x: complex(*eval(x)))(b) print b On Sat, Jun 18, 2011 at 12:31 AM, Bruce Southey wrote: > On 06/17/2011 08:51 AM, Olivier Delalleau wrote: > > 2011/6/17 Bruce Southey >> >> On 06/17/2011 08:22 AM, gary ruben wrote: >> > Thanks Olivier, >> > Your suggestion gets me a little closer to what I want, but doesn't >> > quite work. Replacing the conversion with >> > >> > c = lambda x:np.cast[np.complex64](complex(*eval(x))) >> > b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, >> > 3:c},dtype=None,delimiter=18,usecols=range(4)) >> > >> > produces >> > >> > [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) >> > ? (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] >> > ? [(-5.4797000885+0j) (1.85850000381-1.5501999855j) >> > ? (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] >> > ? [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) >> > ? (-1.34669995308+1.65789997578j)] >> > ? [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] >> > >> > which is not yet an array of complex numbers. It seems close to the >> > solution though. >> > >> > Gary >> > >> > On Fri, Jun 17, 2011 at 8:40 PM, Olivier Delalleau >> > ?wrote: >> >> If I understand correctly, your error is that you convert only the >> >> second >> >> column, because your converters dictionary contains a single key (1). >> >> If you have it contain keys from 0 to 3 associated to the same >> >> function, it >> >> should work. >> >> >> >> -=- Olivier >> >> >> >> 2011/6/17 gary ruben >> >>> I'm trying to read a file containing data formatted as in the >> >>> following example using genfromtxt and I'm doing something wrong. It >> >>> almost works. Can someone point out my error, or suggest a simpler >> >>> solution to the ugly converter function? I thought I'd leave in the >> >>> commented-out line for future reference, which I thought was a neat >> >>> way to get genfromtxt to show what it is trying to pass to the >> >>> converter. >> >>> >> >>> import numpy as np >> >>> from StringIO import StringIO >> >>> >> >>> a = StringIO('''\ >> >>> ? (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, >> >>> 1.7330) >> >>> ? (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) >> >>> (-0.4805,-1.1976) >> >>> ? ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, >> >>> 1.6579) >> >>> ? ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( >> >>> 2.5619,-3.3708) >> >>> ''') >> >>> >> >>> #~ b = np.genfromtxt(a,converters={1:lambda >> >>> x:str(x)},dtype=object,delimiter=18) >> >>> b = np.genfromtxt(a,converters={1:lambda >> >>> x:complex(*eval(x))},dtype=None,delimiter=18,usecols=range(4)) >> >>> >> >>> print b >> >>> >> >>> -- >> >>> >> >>> This produces >> >>> [ (' (-3.9700,-5.0400)', (-1.1318-2.5693j), ' (-4.6027,-0.1426)', ' >> >>> (-1.4249, 1.7330)') >> >>> ? (' (-5.4797, 0.0000)', (1.8585-1.5502j), ' ( 4.4145,-0.7638)', ' >> >>> (-0.4805,-1.1976)') >> >>> ? (' ( 0.0000, 0.0000)', (6.2673+0j), ' (-0.4504,-0.0290)', ' >> >>> (-1.3467, >> >>> 1.6579)') >> >>> ? (' ( 0.0000, 0.0000)', 0j, ' (-3.5000, 0.0000)', ' ( >> >>> 2.5619,-3.3708)')] >> >>> >> >>> which I just need to unpack into a 4x4 array, but I get an error if I >> >>> try to apply a different view. >> >>> >> >>> thanks, >> >>> Gary >> >>> _______________________________________________ >> >>> NumPy-Discussion mailing list >> >>> NumPy-Discussion at scipy.org >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion at scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> Just an observation for the StringIO object, you have multiple spaces >> within the parentheses but, by default, you are using whitespace >> delimiters in genfromtxt. So, yes, genfromtxt is going have issues. >> >> If you can rewrite the input, then you need a non-space and non-comma >> delimiter then specify that delimiter to genfromtxt. Otherwise you are >> probably going to have to write you own parser - for each line, split on >> ' (' etc. >> >> Bruce > > It's funny though because that part (the parsing) actually seems to work. > > However I've been playing a bit with his example and indeed I can't get > numpy to return a complex array. It keeps resulting in an "object" dtype. > The only way I found was to convert it temporarily into a list to recast it > in complex64, adding the line: > ? b = np.array(map(list, b), dtype=np.complex64) > > -=- Olivier > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > Yes, my mistake as I did not see that the delimiter was set to 18. > The problem is about the conversion into complex which I do not think is the > converter per se. So a brute force way is: > > b= np.genfromtxt(a, dtype=str,delimiter=18) > nrow, ncol=b.shape > print b > d=np.empty((nrow, ncol), dtype=np.complex) > for row in range(nrow): > ??? for col in range(ncol): > ??????? d[row][col]=complex(*eval(b[row][col])) > print d > > > Bruce > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From gokhansever at gmail.com Fri Jun 17 12:48:30 2011 From: gokhansever at gmail.com (=?UTF-8?Q?G=C3=B6khan_Sever?=) Date: Fri, 17 Jun 2011 10:48:30 -0600 Subject: [Numpy-discussion] Unicode characters in a numpy array In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 8:54 PM, Charles R Harris wrote: > > > On Wed, Jun 15, 2011 at 1:30 PM, G?khan Sever wrote: >> >> Hello, >> >> The following snippet works fine for a regular string and prints out >> the string without a problem. >> >> python >> Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) >> [GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> mystr = u"??????" >> >>> mystr >> u'\xf6\xf6\xf6\u011f\u011f\u011f' >> >>> type(mystr) >> >> >>> print mystr >> ?????? >> >> What is the correct way to print out the following array? >> >> >>> import numpy as np >> >>> arr = np.array(u"??????") >> >>> arr >> array(u'\xf6\xf6\xf6\u011f\u011f\u011f', >> ? ? ?dtype='> >>> print arr >> Traceback (most recent call last): >> ?File "", line 1, in >> ?File "/usr/lib64/python2.7/site-packages/numpy/core/numeric.py", >> line 1379, in array_str >> ? ?return array2string(a, max_line_width, precision, suppress_small, >> ' ', "", str) >> ?File "/usr/lib64/python2.7/site-packages/numpy/core/arrayprint.py", >> line 426, in array2string >> ? ?lst = style(x) >> UnicodeEncodeError: 'ascii' codec can't encode characters in position >> 0-5: ordinal not in range(128) >> > > I don't know. It might be that we need to fix the printing functions for > unicode and maybe have some way to set the codec as well. > > Chuck > Typing arr = np.array(u"??????") yields UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-22: ordinal not in range(128) in IPython 0.10. I am not sure if this is fixed in the new-coming IPython. Typing the array in this form (with brackets) makes a difference: >>> arr = np.array([u"??????"]) >>> print arr [u'\xf6\xf6\xf6\u011f\u011f\u011f'] >>> print arr[0] ?????? I am wondering whether "print arr" should print out the unicode characters in human-readable format or in this current form. This applies to the regular Python lists as well. >>> mylist = [u"??????"] >>> print mylist [u'\xf6\xf6\xf6\u011f\u011f\u011f'] >>> print mylist[0] ?????? From ben.root at ou.edu Fri Jun 17 12:56:59 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 17 Jun 2011 11:56:59 -0500 Subject: [Numpy-discussion] unwrap for masked arrays? Message-ID: It does not appear that unwrap works properly for masked arrays. First, it uses np.asarray() at the start of the function. However, that alone would not fix the problem given the nature of how unwrap works (performing diff operations). I tried a slightly modified version of unwrap, but I could not get it to always work properly. Anybody know of an alternative or a work-around? Thanks, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Fri Jun 17 13:38:22 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Fri, 17 Jun 2011 10:38:22 -0700 Subject: [Numpy-discussion] get a dtypes' range? Message-ID: <4DFB910E.5080603@noaa.gov> Hi all, I'm wondering if there is a way to get the range of values a given dtype can hold? essentially, I'm looking for something like sys.maxint, for for whatever numpy dtype I have n hand (and eps for floating point types). I was hoping there would be something like a_dtype.max_value a_dtype.min_value etc. In this case, I have a uint16, so I can hard code it, but it would be nice to be able to be write the code in a more generic fashion. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From silva at lma.cnrs-mrs.fr Fri Jun 17 14:01:53 2011 From: silva at lma.cnrs-mrs.fr (Fabrice Silva) Date: Fri, 17 Jun 2011 20:01:53 +0200 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: <4DFB910E.5080603@noaa.gov> References: <4DFB910E.5080603@noaa.gov> Message-ID: <1308333713.13504.0.camel@amilo.coursju> Le vendredi 17 juin 2011 ? 10:38 -0700, Christopher Barker a ?crit : > Hi all, > > I'm wondering if there is a way to get the range of values a given dtype > can hold? > > essentially, I'm looking for something like sys.maxint, for for whatever > numpy dtype I have n hand (and eps for floating point types). > > I was hoping there would be something like > > a_dtype.max_value > a_dtype.min_value > > etc. > > In this case, I have a uint16, so I can hard code it, but it would be > nice to be able to be write the code in a more generic fashion. http://docs.scipy.org/doc/numpy/reference/routines.dtype.html#data-type-information -- Fabrice Silva From mwwiebe at gmail.com Fri Jun 17 14:05:10 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 17 Jun 2011 13:05:10 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: <91FE4549-74FE-421A-A1B7-702B63C500E9@astro.physik.uni-goettingen.de> References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> <91FE4549-74FE-421A-A1B7-702B63C500E9@astro.physik.uni-goettingen.de> Message-ID: On Thu, Jun 16, 2011 at 8:18 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 17.06.2011, at 2:02AM, Mark Wiebe wrote: > > >> ok, that was a lengthy hunt, but it's in printing the string in > make_iso_8601_date: > >> > >> tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); > >> fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, > sublen, dts->year, substr); > >> > >> produces > >> > >> >>> np.datetime64('1970-03-23 20:00:00Z', 'D') > >> printed 4[62]: dts->year=1970: 0000 > >> numpy.datetime64('0000-03-23','D') > >> > >> It seems snprintf is not using the correct format for INT64 (as I > happened to do in fprintf before > >> realising I had to use "%lld" ;-) - could it be this is a general issue, > which just does not show up > >> on little-endian machines because they happen to pass the right half of > the int64 to printf? > >> BTW, how is this supposed to be handled (in 4 digits) if the year is > indeed beyond the 32bit range > >> (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply cast > it to int32 before print. > >> > > I'd prefer to fix the NPY_INT64_FMT macro. There's no point in having it > if it doesn't work... What is NumPy setting it to for that platform? > > > Of course (just felt somewhat lost among all the #defines). It clearly > seems to be mis-constructed > on PowerPC 32: > NPY_SIZEOF_LONG is 4, thus NPY_INT64_FMT is set to NPY_LONGLONG_FMT - "Ld", > but this does not seem to handle int64 on big-endian Macs - explicitly > printing "%Ld", dts->year > also produces 0. > Changing the snprintf format to "%04" "lld" produces the correct output, so > if nothing else > avails, I suggest to put something like > > # elseif (defined(__ppc__) || defined(__ppc64__)) > #define LONGLONG_FMT "lld" > #define ULONGLONG_FMT "llu" > # else > > into npy_common.h (or possibly simply "defined(__APPLE__)", since %lld > seems to > work on 32bit i386 Macs just as well). > Probably a minimally invasive change is best, also this kind of thing deserves a comment explaining the problem that was encountered with the specific platforms, so that in the future when people examine this part they can understand why this is there. Do you want to make a pull request for this change? -Mark > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From efiring at hawaii.edu Fri Jun 17 14:26:03 2011 From: efiring at hawaii.edu (Eric Firing) Date: Fri, 17 Jun 2011 08:26:03 -1000 Subject: [Numpy-discussion] unwrap for masked arrays? In-Reply-To: References: Message-ID: <4DFB9C3B.6080602@hawaii.edu> On 06/17/2011 06:56 AM, Benjamin Root wrote: > It does not appear that unwrap works properly for masked arrays. First, > it uses np.asarray() at the start of the function. However, that alone > would not fix the problem given the nature of how unwrap works > (performing diff operations). I tried a slightly modified version of > unwrap, but I could not get it to always work properly. Anybody know of > an alternative or a work-around? http://currents.soest.hawaii.edu/hgstage/pycurrents/file/08f9137a2a08/data/navcalc.py Eric > > Thanks, > Ben Root > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From Chris.Barker at noaa.gov Fri Jun 17 14:27:07 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Fri, 17 Jun 2011 11:27:07 -0700 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: <1308333713.13504.0.camel@amilo.coursju> References: <4DFB910E.5080603@noaa.gov> <1308333713.13504.0.camel@amilo.coursju> Message-ID: <4DFB9C7B.7090408@noaa.gov> Fabrice Silva wrote: >> I'm wondering if there is a way to get the range of values a given dtype >> can hold? > http://docs.scipy.org/doc/numpy/reference/routines.dtype.html#data-type-information yup, that does it, I hadn't found that, as I was expecting a more OO way, i.e. as a method of the dtype. Also, I need to know in advance if it's a int or a float, but I suppose you can't really do much without knowing that anyway. Actually, I'm a bit confused about dtypes from an OO design perspective anyway. I note that the dtypes seem to have all (most?) of the methods of ndarrays (or placeholders, anyway), which I don't quite get. oh well, I've found what I need, thanks. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From robert.kern at gmail.com Fri Jun 17 14:34:19 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 17 Jun 2011 13:34:19 -0500 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: <4DFB9C7B.7090408@noaa.gov> References: <4DFB910E.5080603@noaa.gov> <1308333713.13504.0.camel@amilo.coursju> <4DFB9C7B.7090408@noaa.gov> Message-ID: On Fri, Jun 17, 2011 at 13:27, Christopher Barker wrote: > Actually, I'm a bit confused about dtypes from an OO design perspective > anyway. I note that the dtypes seem to have all (most?) of the methods > of ndarrays (or placeholders, anyway), which I don't quite get. No, they don't. [~] |33> d = np.dtype(float) [~] |34> dir(d) ['__class__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'alignment', 'base', 'byteorder', 'char', 'descr', 'fields', 'flags', 'hasobject', 'isbuiltin', 'isnative', 'itemsize', 'kind', 'metadata', 'name', 'names', 'newbyteorder', 'num', 'shape', 'str', 'subdtype', 'type'] The numpy *scalar* types do, because they are actual Python types just like any other. They provide the *unbound* methods (i.e. you cannot call them) that will be bound when an instance of it gets created. And the scalar types have many of the same methods as ndarray to allow some ability to write generic functions that will work with either arrays or scalars. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From bsouthey at gmail.com Fri Jun 17 15:02:40 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 17 Jun 2011 14:02:40 -0500 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: <4DFB910E.5080603@noaa.gov> References: <4DFB910E.5080603@noaa.gov> Message-ID: <4DFBA4D0.8000601@gmail.com> On 06/17/2011 12:38 PM, Christopher Barker wrote: > Hi all, > > I'm wondering if there is a way to get the range of values a given dtype > can hold? > > essentially, I'm looking for something like sys.maxint, for for whatever > numpy dtype I have n hand (and eps for floating point types). > > I was hoping there would be something like > > a_dtype.max_value > a_dtype.min_value > > etc. > > In this case, I have a uint16, so I can hard code it, but it would be > nice to be able to be write the code in a more generic fashion. > > -Chris > Do you mean np.finfo() and np.iinfo()? >>> print np.finfo(float) Machine parameters for float64 --------------------------------------------------------------------- precision= 15 resolution= 1.0000000000000001e-15 machep= -52 eps= 2.2204460492503131e-16 negep = -53 epsneg= 1.1102230246251565e-16 minexp= -1022 tiny= 2.2250738585072014e-308 maxexp= 1024 max= 1.7976931348623157e+308 nexp = 11 min= -max --------------------------------------------------------------------- >>> print np.iinfo(int) Machine parameters for int64 --------------------------------------------------------------------- min = -9223372036854775808 max = 9223372036854775807 --------------------------------------------------------------------- >>> sys.maxint 9223372036854775807 Bruce From Chris.Barker at noaa.gov Fri Jun 17 15:26:10 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Fri, 17 Jun 2011 12:26:10 -0700 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: References: <4DFB910E.5080603@noaa.gov> <1308333713.13504.0.camel@amilo.coursju> <4DFB9C7B.7090408@noaa.gov> Message-ID: <4DFBAA52.2080609@noaa.gov> Robert Kern wrote: > On Fri, Jun 17, 2011 at 13:27, Christopher Barker wrote: > >> Actually, I'm a bit confused about dtypes from an OO design perspective >> anyway. I note that the dtypes seem to have all (most?) of the methods >> of ndarrays (or placeholders, anyway), which I don't quite get. > > No, they don't. |33> d = np.dtype(float) [~] |34> dir(d)... > The numpy *scalar* types do, OK -- that is, indeed, what I observed, from: dir(np.uint16) ... > because they are actual Python types just > like any other. They provide the *unbound* methods (i.e. you cannot > call them) that will be bound when an instance of it gets created. And > the scalar types have many of the same methods as ndarray to allow > some ability to write generic functions that will work with either > arrays or scalars. That's handy. I guess I'm confused about the difference between: np.dtype(uint16) and np.uint16 since I always pass in the later when a dtype is called for. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From ben.root at ou.edu Fri Jun 17 15:26:22 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 17 Jun 2011 14:26:22 -0500 Subject: [Numpy-discussion] unwrap for masked arrays? In-Reply-To: <4DFB9C3B.6080602@hawaii.edu> References: <4DFB9C3B.6080602@hawaii.edu> Message-ID: On Fri, Jun 17, 2011 at 1:26 PM, Eric Firing wrote: > On 06/17/2011 06:56 AM, Benjamin Root wrote: > > It does not appear that unwrap works properly for masked arrays. First, > > it uses np.asarray() at the start of the function. However, that alone > > would not fix the problem given the nature of how unwrap works > > (performing diff operations). I tried a slightly modified version of > > unwrap, but I could not get it to always work properly. Anybody know of > > an alternative or a work-around? > > > http://currents.soest.hawaii.edu/hgstage/pycurrents/file/08f9137a2a08/data/navcalc.py > > Eric > Eric, Are you looking over my shoulder or something? This module is exactly what I needed. Need some tweaks to support a 2-D array so that all 1-D parts get the same amount of unwrapping, but it is pretty much what I needed. Thanks! Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Fri Jun 17 15:46:42 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 17 Jun 2011 21:46:42 +0200 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> <91FE4549-74FE-421A-A1B7-702B63C500E9@astro.physik.uni-goettingen.de> Message-ID: <22E40E5D-E0EF-41B5-8FDD-B08B64BBC553@astro.physik.uni-goettingen.de> On 17.06.2011, at 8:05PM, Mark Wiebe wrote: > On Thu, Jun 16, 2011 at 8:18 PM, Derek Homeier wrote: >> On 17.06.2011, at 2:02AM, Mark Wiebe wrote: >> >> >> ok, that was a lengthy hunt, but it's in printing the string in make_iso_8601_date: >> >> >> >> tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); >> >> fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, sublen, dts->year, substr); >> >> >> >> produces >> >> >> >> >>> np.datetime64('1970-03-23 20:00:00Z', 'D') >> >> printed 4[62]: dts->year=1970: 0000 >> >> numpy.datetime64('0000-03-23','D') >> >> >> >> It seems snprintf is not using the correct format for INT64 (as I happened to do in fprintf before >> >> realising I had to use "%lld" ;-) - could it be this is a general issue, which just does not show up >> >> on little-endian machines because they happen to pass the right half of the int64 to printf? >> >> BTW, how is this supposed to be handled (in 4 digits) if the year is indeed beyond the 32bit range >> >> (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply cast it to int32 before print. >> >> >> > I'd prefer to fix the NPY_INT64_FMT macro. There's no point in having it if it doesn't work... What is NumPy setting it to for that platform? >> > >> Of course (just felt somewhat lost among all the #defines). It clearly seems to be mis-constructed >> on PowerPC 32: >> NPY_SIZEOF_LONG is 4, thus NPY_INT64_FMT is set to NPY_LONGLONG_FMT - "Ld", >> but this does not seem to handle int64 on big-endian Macs - explicitly printing "%Ld", dts->year >> also produces 0. >> Changing the snprintf format to "%04" "lld" produces the correct output, so if nothing else >> avails, I suggest to put something like >> >> # elseif (defined(__ppc__) || defined(__ppc64__)) >> #define LONGLONG_FMT "lld" >> #define ULONGLONG_FMT "llu" >> # else >> >> into npy_common.h (or possibly simply "defined(__APPLE__)", since %lld seems to >> work on 32bit i386 Macs just as well). >> > Probably a minimally invasive change is best, also this kind of thing deserves a comment explaining the problem that was encountered with the specific platforms, so that in the future when people examine this part they can understand why this is there. Do you want to make a pull request for this change? > I'd go with the defined(__APPLE__) then, since %Ld produces wrong results on both 32bit platforms. More precisely, this print "%Ld - %Ld", dts->year, dts->year produces "0 - 1970" on ppc and "1970 - 0" on i386, while "%lld - %lld" prints "1970 - 1970" on both archs. There still is an issue (I now remember this came up with a different test a few months ago), that none of the formats seems to be able to actually print numbers > 2**32 (or 2**31, don't remember), but this seemed out of reach for anyone on this list. Cheers, Derek From ben.root at ou.edu Fri Jun 17 16:03:53 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 17 Jun 2011 15:03:53 -0500 Subject: [Numpy-discussion] crash in multiarray Message-ID: Using the master branch, I was running the scipy tests when a crash occurred. I believe the crash originates within numpy. The tests for numpy do pass on my machine (F15, x86_64, python 2.7). Below is the backtrace: *** glibc detected *** python: double free or corruption (!prev): 0x000000000524a450 *** ======= Backtrace: ========= /lib64/libc.so.6[0x35dae76fea] /home/bvr/Programs/numpy/numpy/core/multiarray.so(+0x93dd6)[0x7fe12001bdd6] /usr/lib64/libpython2.7.so.1.0[0x35eca7df0f] /usr/lib64/libpython2.7.so.1.0[0x35eca9af9c] /usr/lib64/libpython2.7.so.1.0[0x35eca6c242] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x53a)[0x35ecae0e0a] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5556)[0x35ecadfa86] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6d9ac] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1340)[0x35ecadb870] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5b68)[0x35ecae0098] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6daa3] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1340)[0x35ecadb870] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6d9ac] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca57b6f] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca9e074] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3b73)[0x35ecade0a3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5b68)[0x35ecae0098] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6daa3] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1340)[0x35ecadb870] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6d9ac] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca57b6f] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca9e074] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3b73)[0x35ecade0a3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6daa3] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1340)[0x35ecadb870] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6d9ac] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca57b6f] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca9e074] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3b73)[0x35ecade0a3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6daa3] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1340)[0x35ecadb870] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6d9ac] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca57b6f] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0[0x35eca9e074] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x53)[0x35eca48fe3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3b73)[0x35ecade0a3] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x855)[0x35ecae1125] /usr/lib64/libpython2.7.so.1.0[0x35eca6daa3] Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Jun 17 16:07:51 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 17 Jun 2011 15:07:51 -0500 Subject: [Numpy-discussion] crash in multiarray In-Reply-To: References: Message-ID: On Fri, Jun 17, 2011 at 15:03, Benjamin Root wrote: > Using the master branch, I was running the scipy tests when a crash > occurred.? I believe the crash originates within numpy.? The tests for numpy > do pass on my machine (F15, x86_64, python 2.7).? Below is the backtrace: Please run "scipy.test(verbose=2)" in order to determine what test is triggering the crash. Even better, if you can install faulthandler and run the tests under that, you can give us a full Python traceback where the segfault occurs. http://pypi.python.org/pypi/faulthandler -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 17 16:10:37 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 17 Jun 2011 15:10:37 -0500 Subject: [Numpy-discussion] get a dtypes' range? In-Reply-To: <4DFBAA52.2080609@noaa.gov> References: <4DFB910E.5080603@noaa.gov> <1308333713.13504.0.camel@amilo.coursju> <4DFB9C7B.7090408@noaa.gov> <4DFBAA52.2080609@noaa.gov> Message-ID: On Fri, Jun 17, 2011 at 14:26, Christopher Barker wrote: > I guess I'm confused about the difference between: > > np.dtype(uint16) > > and > > np.uint16 > > since I always pass in the later when a dtype is called for. Whenever a dtype= argument is present, we also accept anything that can be converted to a dtype via np.dtype(x). The following are all equivalent: dtype=float dtype='float' dtype='float64' dtype=np.float64 dtype='f8' dtype='d' dtype=np.dtype(float) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From ben.root at ou.edu Fri Jun 17 16:18:54 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 17 Jun 2011 15:18:54 -0500 Subject: [Numpy-discussion] crash in multiarray In-Reply-To: References: Message-ID: On Fri, Jun 17, 2011 at 3:07 PM, Robert Kern wrote: > On Fri, Jun 17, 2011 at 15:03, Benjamin Root wrote: > > Using the master branch, I was running the scipy tests when a crash > > occurred. I believe the crash originates within numpy. The tests for > numpy > > do pass on my machine (F15, x86_64, python 2.7). Below is the backtrace: > > Please run "scipy.test(verbose=2)" in order to determine what test is > triggering the crash. > > Even better, if you can install faulthandler and run the tests under > that, you can give us a full Python traceback where the segfault > occurs. > > http://pypi.python.org/pypi/faulthandler > > Cool module. Ok, looks like the failure is triggered here: File "/home/bvr/Programs/scipy/scipy/sparse/linalg/eigen/arpack/tests/test_arpack.py", line 378 in test_ticket_1459_arpack_crash Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Jun 17 16:39:21 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 17 Jun 2011 15:39:21 -0500 Subject: [Numpy-discussion] crash in multiarray In-Reply-To: References: Message-ID: On Fri, Jun 17, 2011 at 15:18, Benjamin Root wrote: > On Fri, Jun 17, 2011 at 3:07 PM, Robert Kern wrote: >> >> On Fri, Jun 17, 2011 at 15:03, Benjamin Root wrote: >> > Using the master branch, I was running the scipy tests when a crash >> > occurred.? I believe the crash originates within numpy.? The tests for >> > numpy >> > do pass on my machine (F15, x86_64, python 2.7).? Below is the >> > backtrace: >> >> Please run "scipy.test(verbose=2)" in order to determine what test is >> triggering the crash. >> >> Even better, if you can install faulthandler and run the tests under >> that, you can give us a full Python traceback where the segfault >> occurs. >> >> ?http://pypi.python.org/pypi/faulthandler >> > > Cool module.? Ok, looks like the failure is triggered here: > > File > "/home/bvr/Programs/scipy/scipy/sparse/linalg/eigen/arpack/tests/test_arpack.py", > line 378 in test_ticket_1459_arpack_crash Well, if it's testing that it crashes, test passed! Here is the referenced ticket and the relevant commits: http://projects.scipy.org/scipy/ticket/1459 https://github.com/pv/scipy-work/commit/ee92958b24cabb77b90b2426653d133ba956edd2 https://github.com/pv/scipy-work/commit/46aca8abdcdd4706945dd513f949400e903585f8 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From pav at iki.fi Fri Jun 17 16:48:48 2011 From: pav at iki.fi (Pauli Virtanen) Date: Fri, 17 Jun 2011 20:48:48 +0000 (UTC) Subject: [Numpy-discussion] crash in multiarray References: Message-ID: On Fri, 17 Jun 2011 15:39:21 -0500, Robert Kern wrote: [clip] >> File >> "/home/bvr/Programs/scipy/scipy/sparse/linalg/eigen/arpack/tests/test_arpack.py", >> line 378 in test_ticket_1459_arpack_crash > > Well, if it's testing that it crashes, test passed! > > Here is the referenced ticket and the relevant commits: > > http://projects.scipy.org/scipy/ticket/1459 rm -rf build and try again. distutils does not rebuild the module even when some files have changed. Pauli From derek at astro.physik.uni-goettingen.de Fri Jun 17 16:06:04 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 17 Jun 2011 22:06:04 +0200 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> Message-ID: Hi Gary, On 17.06.2011, at 5:39PM, gary ruben wrote: > Thanks for the hints Olivier and Bruce. Based on them, the following > is a working solution, although I still have that itchy sense that genfromtxt > should be able to do it directly. > > import numpy as np > from StringIO import StringIO > > a = StringIO('''\ > (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) > (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) > ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) > ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) > ''') > > b = np.genfromtxt(a, dtype=str, delimiter=18)[:,:-1] > b = np.vectorize(lambda x: complex(*eval(x)))(b) > > print b It should, I think you were very close in your earlier attempt: > On Sat, Jun 18, 2011 at 12:31 AM, Bruce Southey wrote: >> On 06/17/2011 08:51 AM, Olivier Delalleau wrote: >> >> 2011/6/17 Bruce Southey >>> >>> On 06/17/2011 08:22 AM, gary ruben wrote: >>>> Thanks Olivier, >>>> Your suggestion gets me a little closer to what I want, but doesn't >>>> quite work. Replacing the conversion with >>>> >>>> c = lambda x:np.cast[np.complex64](complex(*eval(x))) >>>> b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, >>>> 3:c},dtype=None,delimiter=18,usecols=range(4)) >>>> >>>> produces >>>> >>>> [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) >>>> (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] >>>> [(-5.4797000885+0j) (1.85850000381-1.5501999855j) >>>> (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] >>>> [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) >>>> (-1.34669995308+1.65789997578j)] >>>> [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] >>>> >>>> which is not yet an array of complex numbers. It seems close to the >>>> solution though. You were just overdoing it by already creating an array with the converter, this apparently caused genfromtxt to create a structured array from the input (which could be converted back to an ndarray, but that can prove tricky as well) - similar, if you omit the dtype=None. The following cnv = dict.fromkeys(range(4), lambda x: complex(*eval(x))) b = np.genfromtxt(a,converters=cnv, dtype=None, delimiter=18, usecols=range(4)) directly produces a shape(4,4) complex array for me (you may have to apply an .astype(np.complex64) afterwards if so desired). BTW I think this is an interesting enough case of reading non-trivially structured data that it deserves to appear on some examples or cookbook page. HTH, Derek From ralf.gommers at googlemail.com Fri Jun 17 17:00:09 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Fri, 17 Jun 2011 23:00:09 +0200 Subject: [Numpy-discussion] numpy build: automatic fortran detection In-Reply-To: References: Message-ID: On Mon, Jun 13, 2011 at 8:53 PM, Russell E. Owen wrote: > In article , > Ralf Gommers wrote: > > > On Thu, Jun 9, 2011 at 11:46 PM, Russell E. Owen wrote: > > > > > What would it take to automatically detect which flavor of fortran to > > > use to build numpy on linux? > > > > > > > You want to figure out which compiler was used to build BLAS/LAPACK/ATLAS > > and check that the numpy build uses the same, right? Assuming you only > care > > about g77/gfortran, you can try this (from > > http://docs.scipy.org/doc/numpy/user/install.html): > > """ > > How to check the ABI of blas/lapack/atlas > > ----------------------------------------- > > One relatively simple and reliable way to check for the compiler used to > > build a library is to use ldd on the library. If libg2c.so is a > dependency, > > this means that g77 has been used. If libgfortran.so is a a dependency, > > gfortran has been used. If both are dependencies, this means both have > been > > used, which is almost always a very bad idea. > > """ > > > > You could do something similar for other compilers if needed. It would > help > > to know exactly what problem you are trying to solve. > > I'm trying to automate the process of figuring out which fortran to use > because we have a build system that should install numpy for users, > without user interaction. > I assume you're installing scipy as well? There's no Fortran code in numpy. > > We found this out the hard way by not specifying a compiler and ending > up with a numpy that was mis-built (according to its helpful unit tests). > > However, it appears that g77 is very old so I'm now wondering if it > would make sense to switch to gfortran for the default? > This makes sense. Only if you're building on Windows and build with MinGW you have to give it some more thought - we're still building the distributed binaries with g77. > > I think our own procedure will be to assume gfortran and complain to > users if it's not right. (We can afford the time to run the test). > > > > The unit tests are clever enough to detect a mis-build (though > > > surprisingly that is not done as part of the build process), so surely > > > it can be done. > > > > > > > The test suite takes some time to run. It would be very annoying if it > ran > > by default on every rebuild. It's easy to write a build script that > builds > > numpy, then runs the tests, if that's what you need. > > In this case the only relevant test is the test for the correct fortran > compiler. That is the only test I was proposing be performed. > > There's not a single test like that in the test suite, unless I missed something. But you can easily implement it for gfortran/g77. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From shish at keba.be Fri Jun 17 17:01:45 2011 From: shish at keba.be (Olivier Delalleau) Date: Fri, 17 Jun 2011 17:01:45 -0400 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> Message-ID: 2011/6/17 Derek Homeier > Hi Gary, > > On 17.06.2011, at 5:39PM, gary ruben wrote: > > Thanks for the hints Olivier and Bruce. Based on them, the following > > is a working solution, although I still have that itchy sense that > genfromtxt > > should be able to do it directly. > > > > import numpy as np > > from StringIO import StringIO > > > > a = StringIO('''\ > > (-3.9700,-5.0400) (-1.1318,-2.5693) (-4.6027,-0.1426) (-1.4249, 1.7330) > > (-5.4797, 0.0000) ( 1.8585,-1.5502) ( 4.4145,-0.7638) (-0.4805,-1.1976) > > ( 0.0000, 0.0000) ( 6.2673, 0.0000) (-0.4504,-0.0290) (-1.3467, 1.6579) > > ( 0.0000, 0.0000) ( 0.0000, 0.0000) (-3.5000, 0.0000) ( 2.5619,-3.3708) > > ''') > > > > b = np.genfromtxt(a, dtype=str, delimiter=18)[:,:-1] > > b = np.vectorize(lambda x: complex(*eval(x)))(b) > > > > print b > > It should, I think you were very close in your earlier attempt: > > > On Sat, Jun 18, 2011 at 12:31 AM, Bruce Southey > wrote: > >> On 06/17/2011 08:51 AM, Olivier Delalleau wrote: > >> > >> 2011/6/17 Bruce Southey > >>> > >>> On 06/17/2011 08:22 AM, gary ruben wrote: > >>>> Thanks Olivier, > >>>> Your suggestion gets me a little closer to what I want, but doesn't > >>>> quite work. Replacing the conversion with > >>>> > >>>> c = lambda x:np.cast[np.complex64](complex(*eval(x))) > >>>> b = np.genfromtxt(a,converters={0:c, 1:c, 2:c, > >>>> 3:c},dtype=None,delimiter=18,usecols=range(4)) > >>>> > >>>> produces > >>>> > >>>> [[(-3.97000002861-5.03999996185j) (-1.1318000555-2.56929993629j) > >>>> (-4.60270023346-0.142599999905j) (-1.42490005493+1.73300004005j)] > >>>> [(-5.4797000885+0j) (1.85850000381-1.5501999855j) > >>>> (4.41450023651-0.763800024986j) (-0.480500012636-1.19760000706j)] > >>>> [0j (6.26730012894+0j) (-0.45039999485-0.0289999991655j) > >>>> (-1.34669995308+1.65789997578j)] > >>>> [0j 0j (-3.5+0j) (2.56189990044-3.37080001831j)]] > >>>> > >>>> which is not yet an array of complex numbers. It seems close to the > >>>> solution though. > > You were just overdoing it by already creating an array with the converter, > this apparently caused genfromtxt to create a structured array from the > input (which could be converted back to an ndarray, but that can prove > tricky as well) - similar, if you omit the dtype=None. The following > > cnv = dict.fromkeys(range(4), lambda x: complex(*eval(x))) > b = np.genfromtxt(a,converters=cnv, dtype=None, delimiter=18, > usecols=range(4)) > > directly produces a shape(4,4) complex array for me (you may have to apply > an .astype(np.complex64) afterwards if so desired). > > BTW I think this is an interesting enough case of reading non-trivially > structured data that it deserves to appear on some examples or cookbook > page. > > HTH, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > I had tried that as well and it doesn't work with numpy 1.4.1 (I get an object array). It may have been fixed in a later version. -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 17 17:07:07 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 17 Jun 2011 16:07:07 -0500 Subject: [Numpy-discussion] code review/build & test for datetime business day API In-Reply-To: <22E40E5D-E0EF-41B5-8FDD-B08B64BBC553@astro.physik.uni-goettingen.de> References: <069EAC6A-E1B7-4126-BC63-A6945F5BC6CE@astro.physik.uni-goettingen.de> <7B7EBC83-E5C7-4E76-842E-EB4431726214@astro.physik.uni-goettingen.de> <91FE4549-74FE-421A-A1B7-702B63C500E9@astro.physik.uni-goettingen.de> <22E40E5D-E0EF-41B5-8FDD-B08B64BBC553@astro.physik.uni-goettingen.de> Message-ID: On Fri, Jun 17, 2011 at 2:46 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 17.06.2011, at 8:05PM, Mark Wiebe wrote: > > > On Thu, Jun 16, 2011 at 8:18 PM, Derek Homeier < > derek at astro.physik.uni-goettingen.de> wrote: > >> On 17.06.2011, at 2:02AM, Mark Wiebe wrote: > >> > >> >> ok, that was a lengthy hunt, but it's in printing the string in > make_iso_8601_date: > >> >> > >> >> tmplen = snprintf(substr, sublen, "%04" NPY_INT64_FMT, dts->year); > >> >> fprintf(stderr, "printed %d[%d]: dts->year=%lld: %s\n", tmplen, > sublen, dts->year, substr); > >> >> > >> >> produces > >> >> > >> >> >>> np.datetime64('1970-03-23 20:00:00Z', 'D') > >> >> printed 4[62]: dts->year=1970: 0000 > >> >> numpy.datetime64('0000-03-23','D') > >> >> > >> >> It seems snprintf is not using the correct format for INT64 (as I > happened to do in fprintf before > >> >> realising I had to use "%lld" ;-) - could it be this is a general > issue, which just does not show up > >> >> on little-endian machines because they happen to pass the right half > of the int64 to printf? > >> >> BTW, how is this supposed to be handled (in 4 digits) if the year is > indeed beyond the 32bit range > >> >> (i.e. >~ 0.3 Hubble times...)? Just wondering if one could simply > cast it to int32 before print. > >> >> > >> > I'd prefer to fix the NPY_INT64_FMT macro. There's no point in having > it if it doesn't work... What is NumPy setting it to for that platform? > >> > > >> Of course (just felt somewhat lost among all the #defines). It clearly > seems to be mis-constructed > >> on PowerPC 32: > >> NPY_SIZEOF_LONG is 4, thus NPY_INT64_FMT is set to NPY_LONGLONG_FMT - > "Ld", > >> but this does not seem to handle int64 on big-endian Macs - explicitly > printing "%Ld", dts->year > >> also produces 0. > >> Changing the snprintf format to "%04" "lld" produces the correct output, > so if nothing else > >> avails, I suggest to put something like > >> > >> # elseif (defined(__ppc__) || defined(__ppc64__)) > >> #define LONGLONG_FMT "lld" > >> #define ULONGLONG_FMT "llu" > >> # else > >> > >> into npy_common.h (or possibly simply "defined(__APPLE__)", since %lld > seems to > >> work on 32bit i386 Macs just as well). > >> > > Probably a minimally invasive change is best, also this kind of thing > deserves a comment explaining the problem that was encountered with the > specific platforms, so that in the future when people examine this part they > can understand why this is there. Do you want to make a pull request for > this change? > > > I'd go with the defined(__APPLE__) then, since %Ld produces wrong results > on both 32bit platforms. More precisely, this print > "%Ld - %Ld", dts->year, dts->year > produces "0 - 1970" on ppc and "1970 - 0" on i386, while "%lld - %lld" > prints "1970 - 1970" on both archs. There still is an issue (I now remember > this came up with a different test a few months ago), that none of the > formats seems to be able to actually print numbers > 2**32 (or 2**31, don't > remember), but this seemed out of reach for anyone on this list. > That proves that it's wrong on i386 as well, and it makes perfect sense for big endian and little endian architectures. -Mark > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 17 17:07:47 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 17 Jun 2011 16:07:47 -0500 Subject: [Numpy-discussion] crash in multiarray In-Reply-To: References: Message-ID: On Fri, Jun 17, 2011 at 3:48 PM, Pauli Virtanen wrote: > On Fri, 17 Jun 2011 15:39:21 -0500, Robert Kern wrote: > [clip] > >> File > >> > "/home/bvr/Programs/scipy/scipy/sparse/linalg/eigen/arpack/tests/test_arpack.py", > >> line 378 in test_ticket_1459_arpack_crash > > > > Well, if it's testing that it crashes, test passed! > > > > Here is the referenced ticket and the relevant commits: > > > > http://projects.scipy.org/scipy/ticket/1459 > > rm -rf build > > and try again. distutils does not rebuild the module even when some > files have changed. > > Pauli > > Ah, that did the trick. Also, just to be clear for anybody reading this thread, it is the scipy's build directory that has to get cleaned out. Thanks, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Fri Jun 17 17:24:08 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 17 Jun 2011 23:24:08 +0200 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> Message-ID: <41FFCE4C-1D43-485B-B984-1B28C2C8A029@astro.physik.uni-goettingen.de> On 17.06.2011, at 11:01PM, Olivier Delalleau wrote: >> You were just overdoing it by already creating an array with the converter, this apparently caused genfromtxt to create a structured array from the input (which could be converted back to an ndarray, but that can prove tricky as well) - similar, if you omit the dtype=None. The following >> >> cnv = dict.fromkeys(range(4), lambda x: complex(*eval(x))) >> b = np.genfromtxt(a,converters=cnv, dtype=None, delimiter=18, usecols=range(4)) >> >> directly produces a shape(4,4) complex array for me (you may have to apply an .astype(np.complex64) afterwards if so desired). >> >> BTW I think this is an interesting enough case of reading non-trivially structured data that it deserves to appear on some examples or cookbook page. >> >> HTH, >> Derek >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > I had tried that as well and it doesn't work with numpy 1.4.1 (I get an object array). It may have been fixed in a later version. OK, I was using the current master from github, but it works in 1.6.0 as well. I still noticed some differences between loadtxt and genfromtxt behaviour, e.g. where loadtxt would be able to take a string from the converter and automatically convert it to a number, whereas in genfromtxt the converter still had to include the float() or complex()... Derek From gruben at bigpond.net.au Fri Jun 17 22:48:52 2011 From: gruben at bigpond.net.au (gary ruben) Date: Sat, 18 Jun 2011 12:48:52 +1000 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: <41FFCE4C-1D43-485B-B984-1B28C2C8A029@astro.physik.uni-goettingen.de> References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> <41FFCE4C-1D43-485B-B984-1B28C2C8A029@astro.physik.uni-goettingen.de> Message-ID: Thanks guys - I'm happy with the solution for now. FYI, Derek's suggestion doesn't work in numpy 1.5.1 either. For any developers following this thread, I think this might be a nice use case for genfromtxt to handle in future. As a corollary of this problem, I wonder whether there's a human-readable text format for complex numbers that genfromtxt can currently easily parse into a complex array? Having the hard-coded value for the number of columns in the converter and the genfromtxt call goes against the philosophy of the function's ability to form an array of shape matching the input layout. Gary On Sat, Jun 18, 2011 at 7:24 AM, Derek Homeier wrote: > On 17.06.2011, at 11:01PM, Olivier Delalleau wrote: > >>> You were just overdoing it by already creating an array with the converter, this apparently caused genfromtxt to create a structured array from the input (which could be converted back to an ndarray, but that can prove tricky as well) - similar, if you omit the dtype=None. The following >>> >>> cnv = dict.fromkeys(range(4), lambda x: complex(*eval(x))) >>> b = np.genfromtxt(a,converters=cnv, dtype=None, delimiter=18, usecols=range(4)) >>> >>> directly produces a shape(4,4) complex array for me (you may have to apply an .astype(np.complex64) afterwards if so desired). >>> >>> BTW I think this is an interesting enough case of reading non-trivially structured data that it deserves to appear on some examples or cookbook page. >>> >>> HTH, >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Derek >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> I had tried that as well and it doesn't work with numpy 1.4.1 (I get an object array). It may have been fixed in a later version. > > OK, I was using the current master from github, but it works in 1.6.0 as well. I still noticed some differences between loadtxt and genfromtxt behaviour, e.g. where loadtxt would be able to take a string from the converter and automatically convert it to a number, whereas in genfromtxt the converter still had to include the float() or complex()... > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From shish at keba.be Sat Jun 18 01:44:45 2011 From: shish at keba.be (Olivier Delalleau) Date: Sat, 18 Jun 2011 01:44:45 -0400 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> <41FFCE4C-1D43-485B-B984-1B28C2C8A029@astro.physik.uni-goettingen.de> Message-ID: For the hardcoded part, you can easily read the first line of your file and split it with the same delimiter to know the number of columns. It's sure it'd be best to be able to be able to skip this part, but you don't need to hardcode this number into your code at least. Something like: n_cols = len(open('myfile.txt').readline().split(delimiter)) -=- Olivier 2011/6/17 gary ruben > Thanks guys - I'm happy with the solution for now. FYI, Derek's > suggestion doesn't work in numpy 1.5.1 either. > For any developers following this thread, I think this might be a nice > use case for genfromtxt to handle in future. > As a corollary of this problem, I wonder whether there's a > human-readable text format for complex numbers that genfromtxt can > currently easily parse into a complex array? Having the hard-coded > value for the number of columns in the converter and the genfromtxt > call goes against the philosophy of the function's ability to form an > array of shape matching the input layout. > > Gary > > On Sat, Jun 18, 2011 at 7:24 AM, Derek Homeier > wrote: > > On 17.06.2011, at 11:01PM, Olivier Delalleau wrote: > > > >>> You were just overdoing it by already creating an array with the > converter, this apparently caused genfromtxt to create a structured array > from the input (which could be converted back to an ndarray, but that can > prove tricky as well) - similar, if you omit the dtype=None. The following > >>> > >>> cnv = dict.fromkeys(range(4), lambda x: complex(*eval(x))) > >>> b = np.genfromtxt(a,converters=cnv, dtype=None, delimiter=18, > usecols=range(4)) > >>> > >>> directly produces a shape(4,4) complex array for me (you may have to > apply an .astype(np.complex64) afterwards if so desired). > >>> > >>> BTW I think this is an interesting enough case of reading non-trivially > structured data that it deserves to appear on some examples or cookbook > page. > >>> > >>> HTH, > >>> Derek > >>> > >>> _______________________________________________ > >>> NumPy-Discussion mailing list > >>> NumPy-Discussion at scipy.org > >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>> > >> I had tried that as well and it doesn't work with numpy 1.4.1 (I get an > object array). It may have been fixed in a later version. > > > > OK, I was using the current master from github, but it works in 1.6.0 as > well. I still noticed some differences between loadtxt and genfromtxt > behaviour, e.g. where loadtxt would be able to take a string from the > converter and automatically convert it to a number, whereas in genfromtxt > the converter still had to include the float() or complex()... > > > > Derek > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Sat Jun 18 12:22:29 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Sat, 18 Jun 2011 18:22:29 +0200 Subject: [Numpy-discussion] genfromtxt converter question In-Reply-To: References: <4DFB587C.60603@gmail.com> <4DFB6549.2090303@gmail.com> <41FFCE4C-1D43-485B-B984-1B28C2C8A029@astro.physik.uni-goettingen.de> Message-ID: <849C2DF0-C1DD-452F-A8F9-949FB9A430B0@astro.physik.uni-goettingen.de> On 18 Jun 2011, at 04:48, gary ruben wrote: > Thanks guys - I'm happy with the solution for now. FYI, Derek's > suggestion doesn't work in numpy 1.5.1 either. > For any developers following this thread, I think this might be a nice > use case for genfromtxt to handle in future. Numpy 1.6.0 and above is handling it in the present. > As a corollary of this problem, I wonder whether there's a > human-readable text format for complex numbers that genfromtxt can > currently easily parse into a complex array? Having the hard-coded > value for the number of columns in the converter and the genfromtxt > call goes against the philosophy of the function's ability to form an > array of shape matching the input layout. genfromtxt and (for regular data such as yours) loadtxt can also parse the standard '1.0+3.14j ...' format practically automatically as long as you specify 'dtype=complex' (and appropriate delimiter, if required). loadtxt does not handle dtype=np.complex64 or np.complex256 at this time due to a bug in the default converters; I have just created a patch for that. In principle genfromtxt and loadtxt in numpy 1.6 can also handle cases similar to your input, but you need to change the tuples at least to contain no spaces - '( -1.4249, 1.7330)' -> '(-1.4249,+1.7330)' or additionally insert another delimiter like ';' - otherwise it's hopeless to correctly infer the number of columns. With such input, np.genfromtxt(a, converters=cnv, dtype=complex) works as expected, and I have also just created a patch that would allow users to more easily specify a default converter for all input data rather than constructing one for every row. A wider-reaching automatic detection of complex formats might be feasible for genfromtxt, but I'd suspect it could create quite some overhead, as I can think of at least two formats that should probably be considered - tuples as above, and pairs of 'real,imag' without parentheses. But feel free to create a ticket if you think this would be an important enhancement. Cheers, Derek From derek at astro.physik.uni-goettingen.de Sat Jun 18 12:56:59 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Sat, 18 Jun 2011 18:56:59 +0200 Subject: [Numpy-discussion] Enhancements and bug fix for npyio Message-ID: <2E280CE7-0DDB-40B0-A849-61EDAB166874@astro.physik.uni-goettingen.de> Hi, in my experience numbers loaded from text files very often require identical conversion for all data (e.g. parsing Fortran-style double precision, or German vs. English decimals...). Yet loadtxt and genfromtxt in such cases require the user to construct a dictionary of converters for every single data column, a task that could be easily handled within the routines. I am therefore putting up a patch for testing that allows users to specify a default converter for all input data as "converters={-1: conv}" (or alternatively {'all': conv}, leaving that open to discussion). Please test the changeset (still lacking tests for the new option) at https://github.com/dhomeier/numpy/compare/master...iocnv-wildcard which also contains some bug fixes to handle complex input. Awaiting your comments, Derek From charlesr.harris at gmail.com Sat Jun 18 16:14:19 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 18 Jun 2011 14:14:19 -0600 Subject: [Numpy-discussion] Polynomial documentation pull request, attn: Ralph Message-ID: Hi Ralph, Pull request #91 adds minimal documentation for the polynomial module. I'll leave backporting it to 1.6.1 up to your discretion. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 19 06:21:44 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 19 Jun 2011 12:21:44 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: On Tue, Jun 14, 2011 at 5:28 AM, Bruce Southey wrote: > On Mon, Jun 13, 2011 at 8:31 PM, Pauli Virtanen wrote: > > On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: > > [clip] > >> OSError: > >> /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: cannot > >> open shared object file: No such file or directory > > > > I think that's a result of Python 3.2 changing the extension module > > file naming scheme (IIRC to a versioned one). > > > > The '.pyd' ending and its Unix counterpart are IIRC hardcoded to the > > failing test, or some support code in Numpy. Clearly, they should instead > > ask Python how it names the extensions modules. This information may be > > available somewhere in the `sys` module. > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Pauli, > I am impressed yet again! > It really saved a lot of time to understand the reason. > > A quick comparison between Python versions: > Python2.7: multiarray.so > Python3.2: multiarray.cpython-32m.so > > Search for the extension leads to PEP 3149 > http://www.python.org/dev/peps/pep-3149/ > That looked familiar: http://projects.scipy.org/numpy/ticket/1749 https://github.com/numpy/numpy/commit/ee0831a8 > > This is POSIX only which excludes windows. I tracked it down to the > list 'libname_ext' defined in file "numpy/ctypeslib.py" line 100: > libname_ext = ['%s.so' % libname, '%s.pyd' % libname] > > On my 64-bit Linux system, I get the following results: > $ python2.4 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python2.5 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python2.7 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python3.1 -c "from distutils import sysconfig; > print(sysconfig.get_config_var('SO'))" > .so > $ python3.2 -c "from distutils import sysconfig; > print(sysconfig.get_config_var('SO'))" > .cpython-32m.so > > My Windows 32-bit Python2.6 install: > >>> from distutils import sysconfig > >>> sysconfig.get_config_var('SO') > '.pyd' > > Making the bug assumption that other OSes including 32-bit and 64-bit > versions also provide the correct suffix, this suggest adding the > import and modification as: > > import from distutils import sysconfig > libname_ext = ['%s%s' % (libname, sysconfig.get_config_var('SO'))] > > Actually if that works for Mac, Sun etc. then 'load_library' function > could be smaller. > The issue is that libraries built as Python extensions have the extra ".cpython-mu32", but other libraries on your system don't. And sysconfig.get_config_var('SO') is used for both. I don't see another config var that allows to distinguish between the two. Unless on platforms where it matters there are separate return values in sysconfig.get_config_vars('SO'). What do you get for that on Linux? The logic of figuring out the right extension string should not be duplicated, distutils.misc_util seems like a good place for it. Can you test if this works for you: https://github.com/rgommers/numpy/tree/sharedlib-ext > Finally the test_basic2 in "tests/test_ctypeslib.py" needs to be > changed as well because it is no longer correct. Just commenting out > the 'fix' appears to work. > > It works in the case of trying to load multiarray, but the function under test would still be broken. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Sun Jun 19 06:28:58 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Sun, 19 Jun 2011 12:28:58 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4A3EECE0-4BB5-4073-A08C-76BE3546082F@astro.physik.uni-goettingen.de> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4A3EECE0-4BB5-4073-A08C-76BE3546082F@astro.physik.uni-goettingen.de> Message-ID: On Tue, Jun 14, 2011 at 2:53 AM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 13.06.2011, at 6:19PM, Ralf Gommers wrote: > > > > >> I used wget using the direct link and it eventually got the complete > >> file after multiple tries. > >> > > Yes, SF is having a pretty bad day. > > > I eventually also got the tarball through the fink/debian mirror list; no > failures on OS X 10.5 i386. There are two different failures with Python3.x > on ppc, > gfortran 4.2.4, that were already present in 1.6.0 and still are on master > however: > > >>> numpy.test('full') > Running unit tests for numpy > NumPy version 1.6.1rc1 > NumPy is installed in /sw/lib/python3.2/site-packages/numpy > Python version 3.2 (r32:88445, Mar 1 2011, 18:28:16) [GCC 4.0.1 (Apple > Inc. build 5493)] > nose version 1.0.0 > ... > ====================================================================== > FAIL: test_return_character.TestF77ReturnCharacter.test_all > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in runTest > self.test(*self.arg) > File > "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", > line 78, in test_all > self.check_function(getattr(self.module, name)) > File > "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", > line 12, in check_function > r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) > File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line 34, in > assert_ > raise AssertionError(msg) > AssertionError: b' ' > > ====================================================================== > FAIL: test_return_character.TestF90ReturnCharacter.test_all > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in runTest > self.test(*self.arg) > File > "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", > line 136, in test_all > self.check_function(getattr(self.module.f90_return_char, name)) > File > "/sw/lib/python3.2/site-packages/numpy/f2py/tests/test_return_character.py", > line 12, in check_function > r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) > File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line 34, in > assert_ > raise AssertionError(msg) > AssertionError: b' ' > > (both errors are raised on the first function tested, t0). > > Could you open a ticket for these? If it's f2py + py3.2 + ppc only I'd like to ignore them for 1.6.1. Thanks, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Sun Jun 19 13:11:13 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Sun, 19 Jun 2011 19:11:13 +0200 Subject: [Numpy-discussion] loadtxt ndmin option In-Reply-To: References: <0D212BB0-437C-4312-9CE1-BF982F3BAC40@astro.physik.uni-goettingen.de> <31740152.post@talk.nabble.com> <6224395B-AE11-424B-98EA-F9C88F50AE18@gmail.com> <4DE50A36.4020402@gmail.com> <1B6FBF74-D6E9-4A8E-9641-B7FEE54C814E@astro.physik.uni-goettingen.de> <1FAE9453-604F-4E83-B79F-307D5B4C5CF7@gmail.com> <0C7E9797-A7FD-4CA2-BBA9-A0613D023D76@astro.physik.uni-goettingen.de> Message-ID: On 31 May 2011, at 21:28, Pierre GM wrote: > On May 31, 2011, at 6:37 PM, Derek Homeier wrote: > >> On 31 May 2011, at 18:25, Pierre GM wrote: >> >>> On May 31, 2011, at 5:52 PM, Derek Homeier wrote: >>>> >>>> I think stuff like multiple delimiters should have been dealt with >>>> before, as the right place to insert the ndmin code (which includes >>>> the decision to squeeze or not to squeeze as well as to add >>>> additional >>>> dimensions, if required) would be right at the end before the >>>> 'unpack' >>>> switch, or rather replacing the bit: >>>> >>>> if usemask: >>>> output = output.view(MaskedArray) >>>> output._mask = outputmask >>>> if unpack: >>>> return output.squeeze().T >>>> return output.squeeze() >>>> >>>> But there it's already not clear to me how to deal with the >>>> MaskedArray case... >>> >>> Oh, easy. >>> You need to replace only the last three lines of genfromtxt with the >>> ones from loadtxt (808-833). Then, if usemask is True, you need to >>> use ma.atleast_Xd instead of np.atleast_Xd. Et voil?. >>> Comments: >>> * I would raise an exception if ndmin isn't correct *before* trying >>> to read the file... >>> * You could define a `collapse_function` that would be >>> `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on >>> the values of `usemask` and `ndmin`... Thanks, that helped to clean up a little bit. >>> If you have any question about numpy.ma, don't hesitate to contact >>> me directly. >> >> Thanks for the directions! I was not sure about the usemask case >> because it presently does not invoke .squeeze() either... > > The idea is that if `usemask` is True, you build a second array (the > mask), that you attach to your main array at the very end (in the > `output=output.view(MaskedArray), output._mask = mask` combo...). > Afterwards, it's a regular MaskedArray that supports the .squeeze() > method... > OK, in both cases output.squeeze() is now used if ndim>ndmin and usemask is False - at least it does not break any tests, so it seems to work with MaskedArrays as well. > >> On a >> possibly related note, genfromtxt also treats the 'unpack'ing of >> structured arrays differently from loadtxt (which returns a list of >> arrays in that case) - do you know if this is on purpose, or also >> rather missing functionality (I guess it might break >> recfromtxt()...)? > > Keep in mind that I haven't touched genfromtxt since 8-10 months or > so. I wouldn't be surprised that it were lagging a bit behind > loadtxt in terms of development. Yes, there'll be some tweaking to > do for recfromtxt (it's OK for now if `ndmin` and `unpack` are the > defaults) and others, but nothing major. Well, at long last I got to implement the above and added the corresponding tests for genfromtxt - with the exception of the dimension-0 cases, since genfromtxt raises an error on empty files. There already is a comment it should perhaps better return an empty array, so I am putting that idea up for discussion here again. I tried to devise a very basic test with masked arrays, just added to test_withmissing now. I also implemented the same unpacking behaviour for structured arrays and just made recfromtxt set unpack=False to work (or should it issue a warning?). The patches are up for review as commit 8ac01636 in my iocnv-wildcard branch: https://github.com/dhomeier/numpy/compare/master...iocnv-wildcard Cheers, Derek From derek at astro.physik.uni-goettingen.de Sun Jun 19 13:34:30 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Sun, 19 Jun 2011 19:34:30 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4A3EECE0-4BB5-4073-A08C-76BE3546082F@astro.physik.uni-goettingen.de> Message-ID: <9B3AD746-E0BC-44CF-8E0C-25014655FDCF@astro.physik.uni-goettingen.de> Hi Ralf, On 19 Jun 2011, at 12:28, Ralf Gommers wrote: > >>> numpy.test('full') > Running unit tests for numpy > NumPy version 1.6.1rc1 > NumPy is installed in /sw/lib/python3.2/site-packages/numpy > Python version 3.2 (r32:88445, Mar 1 2011, 18:28:16) [GCC 4.0.1 > (Apple Inc. build 5493)] > nose version 1.0.0 > ... > ====================================================================== > FAIL: test_return_character.TestF77ReturnCharacter.test_all > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in > runTest > self.test(*self.arg) > File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/ > test_return_character.py", line 78, in test_all > self.check_function(getattr(self.module, name)) > File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/ > test_return_character.py", line 12, in check_function > r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) > File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line > 34, in assert_ > raise AssertionError(msg) > AssertionError: b' ' > > ====================================================================== > FAIL: test_return_character.TestF90ReturnCharacter.test_all > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/sw/lib/python3.2/site-packages/nose/case.py", line 188, in > runTest > self.test(*self.arg) > File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/ > test_return_character.py", line 136, in test_all > self.check_function(getattr(self.module.f90_return_char, name)) > File "/sw/lib/python3.2/site-packages/numpy/f2py/tests/ > test_return_character.py", line 12, in check_function > r = t(array('ab'));assert_( r==asbytes('a'),repr(r)) > File "/sw/lib/python3.2/site-packages/numpy/testing/utils.py", line > 34, in assert_ > raise AssertionError(msg) > AssertionError: b' ' > > (both errors are raised on the first function tested, t0). > > Could you open a ticket for these? If it's f2py + py3.2 + ppc only > I'd like to ignore them for 1.6.1. and 3.1, but I agree. http://projects.scipy.org/numpy/ticket/1871 Cheers, Derek From mwwiebe at gmail.com Sun Jun 19 18:48:27 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sun, 19 Jun 2011 17:48:27 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: On Mon, Jun 13, 2011 at 7:58 AM, Ralf Gommers wrote: > Hi, > > I am pleased to announce the availability of the first release candidate of > NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > #1834 einsum fails for specific shapes > #1837 einsum throws nan or freezes python for specific array shapes > #1838 object <-> structured type arrays regression > #1851 regression for SWIG based code in 1.6.0 > #1863 Buggy results when operating on array copied with astype() > > If no problems are reported, the final release will be in one week. Sources > and binaries can be found at > https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ I've committed a fix to #1870, a regression in 1.6.0, in master. I'll leave you with the judgement call on what to do with it. Cheers, Mark > > Enjoy, > Ralf > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cgohlke at uci.edu Mon Jun 20 03:19:49 2011 From: cgohlke at uci.edu (Christoph Gohlke) Date: Mon, 20 Jun 2011 00:19:49 -0700 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: <4DFEF495.6040400@uci.edu> On 6/13/2011 5:58 AM, Ralf Gommers wrote: > Hi, > > I am pleased to announce the availability of the first release candidate > of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > #1834 einsum fails for specific shapes > #1837 einsum throws nan or freezes python for specific array shapes > #1838 object <-> structured type arrays regression > #1851 regression for SWIG based code in 1.6.0 > #1863 Buggy results when operating on array copied with astype() > > If no problems are reported, the final release will be in one week. > Sources and binaries can be found > at > https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ > > Enjoy, > Ralf > Hi, in case there is going to be another release candidate: could a fix for ticket #1843 (numpy.rec.array fails with open file on Py3) be included in numpy 1.6.1? It's not crucial though. http://projects.scipy.org/numpy/ticket/1843 Christoph From bsouthey at gmail.com Mon Jun 20 10:20:16 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 20 Jun 2011 09:20:16 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: <4DFF5720.6020501@gmail.com> On 06/19/2011 05:21 AM, Ralf Gommers wrote: > > > On Tue, Jun 14, 2011 at 5:28 AM, Bruce Southey > wrote: > > On Mon, Jun 13, 2011 at 8:31 PM, Pauli Virtanen > wrote: > > On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: > > [clip] > >> OSError: > >> > /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: > cannot > >> open shared object file: No such file or directory > > > > I think that's a result of Python 3.2 changing the extension module > > file naming scheme (IIRC to a versioned one). > > > > The '.pyd' ending and its Unix counterpart are IIRC hardcoded to the > > failing test, or some support code in Numpy. Clearly, they > should instead > > ask Python how it names the extensions modules. This information > may be > > available somewhere in the `sys` module. > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Pauli, > I am impressed yet again! > It really saved a lot of time to understand the reason. > > A quick comparison between Python versions: > Python2.7: multiarray.so > Python3.2: multiarray.cpython-32m.so > > > Search for the extension leads to PEP 3149 > http://www.python.org/dev/peps/pep-3149/ > > > That looked familiar: > http://projects.scipy.org/numpy/ticket/1749 > https://github.com/numpy/numpy/commit/ee0831a8 > > > This is POSIX only which excludes windows. I tracked it down to the > list 'libname_ext' defined in file "numpy/ctypeslib.py" line 100: > libname_ext = ['%s.so' % libname, '%s.pyd' % libname] > > On my 64-bit Linux system, I get the following results: > $ python2.4 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python2.5 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python2.7 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" > .so > $ python3.1 -c "from distutils import sysconfig; > print(sysconfig.get_config_var('SO'))" > .so > $ python3.2 -c "from distutils import sysconfig; > print(sysconfig.get_config_var('SO'))" > .cpython-32m.so > > My Windows 32-bit Python2.6 install: > >>> from distutils import sysconfig > >>> sysconfig.get_config_var('SO') > '.pyd' > > Making the bug assumption that other OSes including 32-bit and 64-bit > versions also provide the correct suffix, this suggest adding the > import and modification as: > > import from distutils import sysconfig > libname_ext = ['%s%s' % (libname, sysconfig.get_config_var('SO'))] > > Actually if that works for Mac, Sun etc. then 'load_library' function > could be smaller. > > > The issue is that libraries built as Python extensions have the extra > ".cpython-mu32", but other libraries on your system don't. And > sysconfig.get_config_var('SO') is used for both. I don't see another > config var that allows to distinguish between the two. Unless on > platforms where it matters there are separate return values in > sysconfig.get_config_vars('SO'). What do you get for that on Linux? > > The logic of figuring out the right extension string should not be > duplicated, distutils.misc_util seems like a good place for it. Can > you test if this works for you: > https://github.com/rgommers/numpy/tree/sharedlib-ext > > > Finally the test_basic2 in "tests/test_ctypeslib.py" needs to be > changed as well because it is no longer correct. Just commenting out > the 'fix' appears to work. > > It works in the case of trying to load multiarray, but the function > under test would still be broken. > > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay, So how do I get these changes and apply them to the release candidate? (This part of the development/test workflow needs some attention.) Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 20 14:32:48 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 20 Jun 2011 13:32:48 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs Message-ID: NumPy has a mechanism built in to allow subclasses to adjust or override aspects of the ufunc behavior. While this goal is important, this mechanism only allows for very limited customization, making for instance the masked arrays unable to work with the native ufuncs in a full and proper way. I would like to deprecate the current mechanism, in particular __array_prepare__ and __array_wrap__, and introduce a new method I will describe below. If you've ever used these mechanisms, please review this design to see if it meets your needs. Any class type which would like to override its behavior in ufuncs would define a method called _numpy_ufunc_, and optionally an attribute __array_priority__ as can already be done. The class which wins the priority battle gets its _numpy_ufunc_ function called as follows: return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) To support this overloading, the ufunc would get a new support method, result_type, and there would be a new global function, broadcast_empty_like. The function ufunc.empty_like behaves like the global np.result_type, but produces the output type or a tuple of output types specific to the ufunc, which may follow a different convention than regular arithmetic type promotion. This allows for a class to create an output array of the correct type to pass to the ufunc if it needs to be different than the default. The function broadcast_empty_like is just like empty_like, but takes a list or tuple of arrays which are to be broadcast together for producing the output, instead of just one. Thanks, Mark A simple class which overrides the ufuncs might look as follows: def sin(ufunc, *args, **kwargs): # Convert degrees to radians args[0] = np.deg2rad(args[0]) # Return a regular array, since the result is not in degrees return ufunc(*args, **kwargs) class MyDegreesClass: """Array-like object with a degrees unit""" def __init__(arr): self.arr = arr def _numpy_ufunc_(ufunc, *args, **kwargs): override = globals().get(ufunc.name) if override: return override(ufunc, *args, **kwargs) else: raise TypeError, 'ufunc %s incompatible with MyDegreesClass' % ufunc.name A more complex example will be something like this: def my_general_ufunc(ufunc, *args, **kwargs): # Extract the 'out' argument. This only supports ufuncs with # one output, currently. out = kwargs.get('out') if len(args) > ufunc.nin: if out is None: out = args[ufunc.nin] else: raise ValueError, "'out' given as both a position and keyword argument" # Just want the inputs from here on args = args[:ufunc.nin] # Strip out MyArrayClass, but allow operations with regular ndarrays raw_in = [] for a in args: if isinstance(a, MyArrayClass): raw_in.append(a.arr) else: raw_in.append(a) # Allocate the output array if not out is None: if isinstance(out, MyArrayClass): raise TypeError, "'out' must have type MyArrayClass" else: # Create the output array, obeying the 'order' parameter, # but disallowing subclasses out = np.broadcast_empty_like([args, order=kwargs.get('order'), dtype=ufunc.result_type(args), subok=False) # Override the output argument kwargs['out'] = out.arr # Call the ufunc ufunc(*args, **kwargs) # Return the output return out class MyArrayClass: def __init__(arr): self.arr = arr def _numpy_ufunc_(ufunc, *args, **kwargs): override = globals().get(ufunc.name) if override: return override(ufunc, *args, **kwargs) else: return my_general_ufunc(ufunc, *args, **kwargs) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsouthey at gmail.com Mon Jun 20 14:50:35 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 20 Jun 2011 13:50:35 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> Message-ID: <4DFF967B.7090407@gmail.com> On 06/19/2011 05:21 AM, Ralf Gommers wrote: > > > On Tue, Jun 14, 2011 at 5:28 AM, Bruce Southey > wrote: > > On Mon, Jun 13, 2011 at 8:31 PM, Pauli Virtanen > wrote: > > On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: [clip] > >> OSError: > >> /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: > >> cannot open shared object file: No such file or directory > > > > I think that's a result of Python 3.2 changing the extension > > module file naming scheme (IIRC to a versioned one). > > > > The '.pyd' ending and its Unix counterpart are IIRC hardcoded to > > the failing test, or some support code in Numpy. Clearly, they > > should instead ask Python how it names the extensions modules. This > > information may be available somewhere in the `sys` module. > > > > _______________________________________________ NumPy-Discussion > > mailing list NumPy-Discussion at scipy.org > > > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Pauli, I am impressed yet again! It really saved a lot of time to > understand the reason. > > A quick comparison between Python versions: Python2.7: multiarray.so > Python3.2: multiarray.cpython-32m.so > > > Search for the extension leads to PEP 3149 > http://www.python.org/dev/peps/pep-3149/ > > > That looked familiar: http://projects.scipy.org/numpy/ticket/1749 > https://github.com/numpy/numpy/commit/ee0831a8 > > > This is POSIX only which excludes windows. I tracked it down to the > list 'libname_ext' defined in file "numpy/ctypeslib.py" line 100: > libname_ext = ['%s.so' % libname, '%s.pyd' % libname] > > On my 64-bit Linux system, I get the following results: $ python2.4 > -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" .so $ python2.5 -c "from distutils > import sysconfig; print sysconfig.get_config_var('SO')" .so $ > python2.7 -c "from distutils import sysconfig; print > sysconfig.get_config_var('SO')" .so $ python3.1 -c "from distutils > import sysconfig; print(sysconfig.get_config_var('SO'))" .so $ > python3.2 -c "from distutils import sysconfig; > print(sysconfig.get_config_var('SO'))" .cpython-32m.so > > My Windows 32-bit Python2.6 install: > >>> from distutils import sysconfig sysconfig.get_config_var('SO') > '.pyd' > > Making the bug assumption that other OSes including 32-bit and > 64-bit versions also provide the correct suffix, this suggest adding > the import and modification as: > > import from distutils import sysconfig libname_ext = ['%s%s' % > (libname, sysconfig.get_config_var('SO'))] > > Actually if that works for Mac, Sun etc. then 'load_library' > function could be smaller. > > > The issue is that libraries built as Python extensions have the extra > ".cpython-mu32", but other libraries on your system don't. And > sysconfig.get_config_var('SO') is used for both. I don't see another > config var that allows to distinguish between the two. Unless on > platforms where it matters there are separate return values in > sysconfig.get_config_vars('SO'). What do you get for that on Linux? > > The logic of figuring out the right extension string should not be > duplicated, distutils.misc_util seems like a good place for it. Can > you test if this works for you: > https://github.com/rgommers/numpy/tree/sharedlib-ext > > > Finally the test_basic2 in "tests/test_ctypeslib.py" needs to be > changed as well because it is no longer correct. Just commenting out > the 'fix' appears to work. > > It works in the case of trying to load multiarray, but the function > under test would still be broken. > > Ralf > > > _______________________________________________ NumPy-Discussion > mailing list NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion I copied the files but that just moves the problem. So that patch is incorrect. I get the same errors on Fedora 15 supplied Python3.2 for numpy 1.6.0 and using git from 'https://github.com/rgommers/numpy.git'. Numpy is getting Fedora supplied Atlas (1.5.1 does not). It appears that there is a misunderstanding of the PEP because 'SO' and 'SOABI' do exactly what the PEP says on my systems: > >> from distutils import sysconfig sysconfig.get_config_var('SO') '.cpython-32m.so' > >> sysconfig.get_config_var('SOABI') 'cpython-32m' Consequently, the name, 'multiarray.pyd', created within numpy is invalid. Looking the code, I see this line which makes no sense given that the second part is true under Linux: if (not is_python_ext) and 'SOABI' in distutils.sysconfig.get_config_vars(): So I think the 'get_shared_lib_extension' function is wrong and probably unneeded. Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Mon Jun 20 15:17:47 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 21:17:47 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4DFF5720.6020501@gmail.com> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF5720.6020501@gmail.com> Message-ID: On Mon, Jun 20, 2011 at 4:20 PM, Bruce Southey wrote: > ** > On 06/19/2011 05:21 AM, Ralf Gommers wrote: > > > > On Tue, Jun 14, 2011 at 5:28 AM, Bruce Southey wrote: > >> On Mon, Jun 13, 2011 at 8:31 PM, Pauli Virtanen wrote: >> > On Mon, 13 Jun 2011 11:08:18 -0500, Bruce Southey wrote: >> > [clip] >> >> OSError: >> >> /usr/local/lib/python3.2/site-packages/numpy/core/multiarray.pyd: >> cannot >> >> open shared object file: No such file or directory >> > >> > I think that's a result of Python 3.2 changing the extension module >> > file naming scheme (IIRC to a versioned one). >> > >> > The '.pyd' ending and its Unix counterpart are IIRC hardcoded to the >> > failing test, or some support code in Numpy. Clearly, they should >> instead >> > ask Python how it names the extensions modules. This information may be >> > available somewhere in the `sys` module. >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> >> Pauli, >> I am impressed yet again! >> It really saved a lot of time to understand the reason. >> >> A quick comparison between Python versions: >> Python2.7: multiarray.so >> Python3.2: multiarray.cpython-32m.so >> >> Search for the extension leads to PEP 3149 >> http://www.python.org/dev/peps/pep-3149/ >> > > That looked familiar: > http://projects.scipy.org/numpy/ticket/1749 > https://github.com/numpy/numpy/commit/ee0831a8 > >> >> This is POSIX only which excludes windows. I tracked it down to the >> list 'libname_ext' defined in file "numpy/ctypeslib.py" line 100: >> libname_ext = ['%s.so' % libname, '%s.pyd' % libname] >> >> On my 64-bit Linux system, I get the following results: >> $ python2.4 -c "from distutils import sysconfig; print >> sysconfig.get_config_var('SO')" >> .so >> $ python2.5 -c "from distutils import sysconfig; print >> sysconfig.get_config_var('SO')" >> .so >> $ python2.7 -c "from distutils import sysconfig; print >> sysconfig.get_config_var('SO')" >> .so >> $ python3.1 -c "from distutils import sysconfig; >> print(sysconfig.get_config_var('SO'))" >> .so >> $ python3.2 -c "from distutils import sysconfig; >> print(sysconfig.get_config_var('SO'))" >> .cpython-32m.so >> >> My Windows 32-bit Python2.6 install: >> >>> from distutils import sysconfig >> >>> sysconfig.get_config_var('SO') >> '.pyd' >> >> Making the bug assumption that other OSes including 32-bit and 64-bit >> versions also provide the correct suffix, this suggest adding the >> import and modification as: >> >> import from distutils import sysconfig >> libname_ext = ['%s%s' % (libname, sysconfig.get_config_var('SO'))] >> >> Actually if that works for Mac, Sun etc. then 'load_library' function >> could be smaller. >> > > The issue is that libraries built as Python extensions have the extra > ".cpython-mu32", but other libraries on your system don't. And > sysconfig.get_config_var('SO') is used for both. I don't see another config > var that allows to distinguish between the two. Unless on platforms where it > matters there are separate return values in sysconfig.get_config_vars('SO'). > What do you get for that on Linux? > > The logic of figuring out the right extension string should not be > duplicated, distutils.misc_util seems like a good place for it. Can you test > if this works for you: > https://github.com/rgommers/numpy/tree/sharedlib-ext > > >> Finally the test_basic2 in "tests/test_ctypeslib.py" needs to be >> changed as well because it is no longer correct. Just commenting out >> the 'fix' appears to work. >> >> It works in the case of trying to load multiarray, but the function > under test would still be broken. > > Ralf > > > _______________________________________________ > NumPy-Discussion mailing listNumPy-Discussion at scipy.orghttp://mail.scipy.org/mailman/listinfo/numpy-discussion > > Okay, > So how do I get these changes and apply them to the release candidate? > (This part of the development/test workflow needs some attention.) > > You could just have tested that branch, it should have the same issue. To test on 1.6.x, you can just checkout 1.6.x and cherry-pick the relevant commit. I've done that for you at https://github.com/rgommers/numpy/tree/sharedlib-ext-1.6.x For completeness, this is how you grab it: $ git remote add rgommers https://github.com/rgommers/numpy.git $ git fetch rgommers $ git co rgommers/sharedlib-ext-1.6.x Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Mon Jun 20 15:33:58 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 21:33:58 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 12:48 AM, Mark Wiebe wrote: > On Mon, Jun 13, 2011 at 7:58 AM, Ralf Gommers > wrote: > >> Hi, >> >> I am pleased to announce the availability of the first release candidate >> of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: >> #1834 einsum fails for specific shapes >> #1837 einsum throws nan or freezes python for specific array shapes >> #1838 object <-> structured type arrays regression >> #1851 regression for SWIG based code in 1.6.0 >> #1863 Buggy results when operating on array copied with astype() >> >> If no problems are reported, the final release will be in one week. >> Sources and binaries can be found at >> https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ > > > I've committed a fix to #1870, a regression in 1.6.0, in master. I'll leave > you with the judgement call on what to do with it. > > It's a regression, so it would be good to include it. This together with the ctypes.load_library thing would require a second RC, but that's fine. I don't really understand that code though and don't have much time to test it either, so I would appreciate if someone else can review it. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Mon Jun 20 15:35:03 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 21:35:03 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4DFEF495.6040400@uci.edu> References: <4DFEF495.6040400@uci.edu> Message-ID: On Mon, Jun 20, 2011 at 9:19 AM, Christoph Gohlke wrote: > > > On 6/13/2011 5:58 AM, Ralf Gommers wrote: > > Hi, > > > > I am pleased to announce the availability of the first release candidate > > of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: > > #1834 einsum fails for specific shapes > > #1837 einsum throws nan or freezes python for specific array shapes > > #1838 object <-> structured type arrays regression > > #1851 regression for SWIG based code in 1.6.0 > > #1863 Buggy results when operating on array copied with astype() > > > > If no problems are reported, the final release will be in one week. > > Sources and binaries can be found > > at > > https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ > > > > Enjoy, > > Ralf > > > > Hi, > > in case there is going to be another release candidate: could a fix for > ticket #1843 (numpy.rec.array fails with open file on Py3) be included > in numpy 1.6.1? It's not crucial though. > > http://projects.scipy.org/numpy/ticket/1843 > > This should be reviewed and committed to master first. If someone can do that in the next couple of days I'm fine with backporting it. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Mon Jun 20 15:43:36 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 21:43:36 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4DFF967B.7090407@gmail.com> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> Message-ID: On Mon, Jun 20, 2011 at 8:50 PM, Bruce Southey wrote: > > I copied the files but that just moves the problem. So that patch is > incorrect. > > I get the same errors on Fedora 15 supplied Python3.2 for numpy 1.6.0 and > using git from 'https://github.com/rgommers/numpy.git'. Numpy is getting > Fedora supplied Atlas (1.5.1 does not). > > It appears that there is a misunderstanding of the PEP because 'SO' and > 'SOABI' do exactly what the PEP says on my systems: > It doesn't on OS X. But that's not even the issue. As I explained before, the issue is that get_config_var('SO') is used to determine the extension of system libraries (such as liblapack.so) and python-related ones (such as multiarray.cpython-32m.so). And the current functions don't do mindreading. > > >>> from distutils import sysconfig sysconfig.get_config_var('SO') > '.cpython-32m.so' > >>> sysconfig.get_config_var('SOABI') > 'cpython-32m' > > Consequently, the name, 'multiarray.pyd', created within numpy is invalid. > I removed the line in ctypeslib that was trying this, so I think you are not testing my patch. Ralf > Looking the code, I see this line which makes no sense given that the > second part is true under Linux: > > if (not is_python_ext) and 'SOABI' in > distutils.sysconfig.get_config_vars(): > > So I think the 'get_shared_lib_extension' function is wrong and probably > unneeded. > > > Bruce > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.flint at gmail.com Mon Jun 20 16:15:59 2011 From: alex.flint at gmail.com (Alex Flint) Date: Mon, 20 Jun 2011 16:15:59 -0400 Subject: [Numpy-discussion] fast grayscale conversion Message-ID: At the moment I'm using numpy.dot to convert a WxHx3 RGB image to a grayscale image: src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); This seems quite slow though (several seconds for a 3 megapixel image) - is there a more specialized routine better suited to this? Cheers, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 20 16:38:45 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 20 Jun 2011 15:38:45 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 2:33 PM, Ralf Gommers wrote: > On Mon, Jun 20, 2011 at 12:48 AM, Mark Wiebe wrote: > >> On Mon, Jun 13, 2011 at 7:58 AM, Ralf Gommers < >> ralf.gommers at googlemail.com> wrote: >> >>> Hi, >>> >>> I am pleased to announce the availability of the first release candidate >>> of NumPy 1.6.1. This is a bugfix release, list of fixed bugs: >>> #1834 einsum fails for specific shapes >>> #1837 einsum throws nan or freezes python for specific array shapes >>> #1838 object <-> structured type arrays regression >>> #1851 regression for SWIG based code in 1.6.0 >>> #1863 Buggy results when operating on array copied with astype() >>> >>> If no problems are reported, the final release will be in one week. >>> Sources and binaries can be found at >>> https://sourceforge.net/projects/numpy/files/NumPy/1.6.1rc1/ >> >> >> I've committed a fix to #1870, a regression in 1.6.0, in master. I'll >> leave you with the judgement call on what to do with it. >> >> It's a regression, so it would be good to include it. This together with > the ctypes.load_library thing would require a second RC, but that's fine. > > I don't really understand that code though and don't have much time to test > it either, so I would appreciate if someone else can review it. > I've created a pull request to try and make this as easy as possible: https://github.com/numpy/numpy/pull/92 -Mark > > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zachary.pincus at yale.edu Mon Jun 20 16:41:43 2011 From: zachary.pincus at yale.edu (Zachary Pincus) Date: Mon, 20 Jun 2011 16:41:43 -0400 Subject: [Numpy-discussion] fast grayscale conversion In-Reply-To: References: Message-ID: <8FBD6D54-F068-4FAF-8482-E22BD02CA33D@yale.edu> You could try: src_mono = src_rgb.astype(float).sum(axis=-1) / 3. But that speed does seem slow. Here are the relevant timings on my machine (a recent MacBook Pro) for a 3.1-megapixel-size array: In [16]: a = numpy.empty((2048, 1536, 3), dtype=numpy.uint8) In [17]: timeit numpy.dot(a.astype(float), numpy.ones(3)/3.) 10 loops, best of 3: 116 ms per loop In [18]: timeit a.astype(float).sum(axis=-1)/3. 10 loops, best of 3: 85.3 ms per loop In [19]: timeit a.astype(float) 10 loops, best of 3: 23.3 ms per loop On Jun 20, 2011, at 4:15 PM, Alex Flint wrote: > At the moment I'm using numpy.dot to convert a WxHx3 RGB image to a grayscale image: > > src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); > > This seems quite slow though (several seconds for a 3 megapixel image) - is there a more specialized routine better suited to this? > > Cheers, > Alex > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From efiring at hawaii.edu Mon Jun 20 17:11:33 2011 From: efiring at hawaii.edu (Eric Firing) Date: Mon, 20 Jun 2011 11:11:33 -1000 Subject: [Numpy-discussion] fast grayscale conversion In-Reply-To: <8FBD6D54-F068-4FAF-8482-E22BD02CA33D@yale.edu> References: <8FBD6D54-F068-4FAF-8482-E22BD02CA33D@yale.edu> Message-ID: <4DFFB785.1090600@hawaii.edu> On 06/20/2011 10:41 AM, Zachary Pincus wrote: > You could try: > src_mono = src_rgb.astype(float).sum(axis=-1) / 3. > > But that speed does seem slow. Here are the relevant timings on my machine (a recent MacBook Pro) for a 3.1-megapixel-size array: > In [16]: a = numpy.empty((2048, 1536, 3), dtype=numpy.uint8) > > In [17]: timeit numpy.dot(a.astype(float), numpy.ones(3)/3.) > 10 loops, best of 3: 116 ms per loop > > In [18]: timeit a.astype(float).sum(axis=-1)/3. > 10 loops, best of 3: 85.3 ms per loop > > In [19]: timeit a.astype(float) > 10 loops, best of 3: 23.3 ms per loop > > On my slower machine (older laptop, core2 duo), you can speed it up more: In [3]: timeit a.astype(float).sum(axis=-1)/3.0 1 loops, best of 3: 235 ms per loop In [5]: timeit b = a.astype(float).sum(axis=-1); b /= 3.0 1 loops, best of 3: 181 ms per loop In [7]: timeit b = a.astype(np.float32).sum(axis=-1); b /= 3.0 10 loops, best of 3: 148 ms per loop If you really want float64, it is still faster to do the first operation with single precision: In [8]: timeit b = a.astype(np.float32).sum(axis=-1).astype(np.float64); b /= 3.0 10 loops, best of 3: 163 ms per loop Eric > > > On Jun 20, 2011, at 4:15 PM, Alex Flint wrote: > >> At the moment I'm using numpy.dot to convert a WxHx3 RGB image to a grayscale image: >> >> src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); >> >> This seems quite slow though (several seconds for a 3 megapixel image) - is there a more specialized routine better suited to this? >> >> Cheers, >> Alex >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jsseabold at gmail.com Mon Jun 20 17:36:00 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Mon, 20 Jun 2011 17:36:00 -0400 Subject: [Numpy-discussion] Numpy/Scipy Testing Guidelines URL? Message-ID: Are the testing guidelines included in the HTML docs anywhere? If I recall, they used to be, and I couldn't find them with a brief look/google. I'd like to link to them. Maybe the rendered rst page is considered their new home? https://github.com/numpy/numpy/blob/master/doc/TESTS.rst.txt Skipper From ralf.gommers at googlemail.com Mon Jun 20 17:41:12 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 23:41:12 +0200 Subject: [Numpy-discussion] Numpy/Scipy Testing Guidelines URL? In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 11:36 PM, Skipper Seabold wrote: > Are the testing guidelines included in the HTML docs anywhere? If I > recall, they used to be, and I couldn't find them with a brief > look/google. I'd like to link to them. Maybe the rendered rst page is > considered their new home? > That is the new home. Should be linked from the old home. A patch to add those to the numpy devguide is welcome I think. Ralf > https://github.com/numpy/numpy/blob/master/doc/TESTS.rst.txt > > Skipper > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsseabold at gmail.com Mon Jun 20 17:44:42 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Mon, 20 Jun 2011 17:44:42 -0400 Subject: [Numpy-discussion] Numpy/Scipy Testing Guidelines URL? In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 5:41 PM, Ralf Gommers wrote: > > > On Mon, Jun 20, 2011 at 11:36 PM, Skipper Seabold > wrote: >> >> Are the testing guidelines included in the HTML docs anywhere? If I >> recall, they used to be, and I couldn't find them with a brief >> look/google. I'd like to link to them. Maybe the rendered rst page is >> considered their new home? > > That is the new home. Should be linked from the old home. A patch to add > those to the numpy devguide is welcome I think. > numpy/doc/source/dev? This is what I was thinking. But I didn't know if there was some other grand design to having a lot of the developer specific information in numpy/doc. Skipper From ralf.gommers at googlemail.com Mon Jun 20 17:53:37 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Mon, 20 Jun 2011 23:53:37 +0200 Subject: [Numpy-discussion] Numpy/Scipy Testing Guidelines URL? In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 11:44 PM, Skipper Seabold wrote: > On Mon, Jun 20, 2011 at 5:41 PM, Ralf Gommers > wrote: > > > > > > On Mon, Jun 20, 2011 at 11:36 PM, Skipper Seabold > > wrote: > >> > >> Are the testing guidelines included in the HTML docs anywhere? If I > >> recall, they used to be, and I couldn't find them with a brief > >> look/google. I'd like to link to them. Maybe the rendered rst page is > >> considered their new home? > > > > That is the new home. Should be linked from the old home. A patch to add > > those to the numpy devguide is welcome I think. > > > > numpy/doc/source/dev? Yep, ends up at http://docs.scipy.org/doc/numpy/dev/ > This is what I was thinking. But I didn't know > if there was some other grand design to having a lot of the developer > specific information in numpy/doc. > > If there is, I haven't seen the blueprint. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.flint at gmail.com Mon Jun 20 18:05:37 2011 From: alex.flint at gmail.com (Alex Flint) Date: Mon, 20 Jun 2011 18:05:37 -0400 Subject: [Numpy-discussion] fast grayscale conversion In-Reply-To: <4DFFB785.1090600@hawaii.edu> References: <8FBD6D54-F068-4FAF-8482-E22BD02CA33D@yale.edu> <4DFFB785.1090600@hawaii.edu> Message-ID: Thanks, that's helpful. I'm now getting comparable times on a different machine, it must be something else slowing down my machine more generally, not just numpy. On Mon, Jun 20, 2011 at 5:11 PM, Eric Firing wrote: > On 06/20/2011 10:41 AM, Zachary Pincus wrote: > > You could try: > > src_mono = src_rgb.astype(float).sum(axis=-1) / 3. > > > > But that speed does seem slow. Here are the relevant timings on my > machine (a recent MacBook Pro) for a 3.1-megapixel-size array: > > In [16]: a = numpy.empty((2048, 1536, 3), dtype=numpy.uint8) > > > > In [17]: timeit numpy.dot(a.astype(float), numpy.ones(3)/3.) > > 10 loops, best of 3: 116 ms per loop > > > > In [18]: timeit a.astype(float).sum(axis=-1)/3. > > 10 loops, best of 3: 85.3 ms per loop > > > > In [19]: timeit a.astype(float) > > 10 loops, best of 3: 23.3 ms per loop > > > > > > On my slower machine (older laptop, core2 duo), you can speed it up more: > > In [3]: timeit a.astype(float).sum(axis=-1)/3.0 > 1 loops, best of 3: 235 ms per loop > > In [5]: timeit b = a.astype(float).sum(axis=-1); b /= 3.0 > 1 loops, best of 3: 181 ms per loop > > In [7]: timeit b = a.astype(np.float32).sum(axis=-1); b /= 3.0 > 10 loops, best of 3: 148 ms per loop > > If you really want float64, it is still faster to do the first operation > with single precision: > > In [8]: timeit b = a.astype(np.float32).sum(axis=-1).astype(np.float64); > b /= 3.0 > 10 loops, best of 3: 163 ms per loop > > Eric > > > > > > > > On Jun 20, 2011, at 4:15 PM, Alex Flint wrote: > > > >> At the moment I'm using numpy.dot to convert a WxHx3 RGB image to a > grayscale image: > >> > >> src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); > >> > >> This seems quite slow though (several seconds for a 3 megapixel image) - > is there a more specialized routine better suited to this? > >> > >> Cheers, > >> Alex > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 20 19:01:40 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 20 Jun 2011 18:01:40 -0500 Subject: [Numpy-discussion] Object array from list in 1.6.0 (vs. 1.5.1) In-Reply-To: <4DF24917.5060703@jhu.edu> References: <4DF24917.5060703@jhu.edu> Message-ID: This pull request which needs some testing should fix the issue: https://github.com/numpy/numpy/pull/92 -Mark On Fri, Jun 10, 2011 at 11:40 AM, Ken Basye wrote: > Dear folks, > I have some code that stopped working with 1.6.0 and I'm wondering if > there's a better way to replace it than what I came up with. Here's a > condensed version: > > x = [()] # list containing an empty tuple; this isn't the only case, > but it's one that must be handled correctly > y = np.empty(len(x), dtype=object) > y[:] = x[:] > > In 1.5.1 this works, giving y as "array([()], dtype=object)" > > In 1.6.0, it raises a ValueError: > ValueError: output operand requires a reduction, but reduction is not > enabled > > I didn't see anything in the release notes about this; admittedly it's a > very small corner case. I also don't understand what the error message > is trying to tell me here. > > Most of the more straightforward ways to construct the desired array > don't work because the interpretation of a nested structure is as a > multi-dimensional array, which is reasonable. > > At this point my workaround is to replace the assignment with a loop: > > for i, v in enumerate(x): > y[i] = v > > but this seems non-Numpyish. Any light on what the error means or a > better way to do this would be most welcome. > > Thanks, > Ken > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Mon Jun 20 19:41:00 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Mon, 20 Jun 2011 16:41:00 -0700 Subject: [Numpy-discussion] fast grayscale conversion In-Reply-To: References: <8FBD6D54-F068-4FAF-8482-E22BD02CA33D@yale.edu> <4DFFB785.1090600@hawaii.edu> Message-ID: <4DFFDA8C.8040408@noaa.gov> Alex Flint wrote: > Thanks, that's helpful. I'm now getting comparable times on a different > machine, it must be something else slowing down my machine more > generally, not just numpy. you also might want to get a bit fancier than simply scaling linearly R,G, and B don't necessarily all contribute equally to our sense of "whiteness" For instance, PIL uses: """ When from a colour image to black and white, the library uses the ITU-R 601-2 luma transform: L = R * 299/1000 + G * 587/1000 + B * 114/1000 """ which would be easy enough to do with numpy. -Chris > > On Mon, Jun 20, 2011 at 5:11 PM, Eric Firing > wrote: > > On 06/20/2011 10:41 AM, Zachary Pincus wrote: > > You could try: > > src_mono = src_rgb.astype(float).sum(axis=-1) / 3. > > > > But that speed does seem slow. Here are the relevant timings on > my machine (a recent MacBook Pro) for a 3.1-megapixel-size array: > > In [16]: a = numpy.empty((2048, 1536, 3), dtype=numpy.uint8) > > > > In [17]: timeit numpy.dot(a.astype(float), numpy.ones(3)/3.) > > 10 loops, best of 3: 116 ms per loop > > > > In [18]: timeit a.astype(float).sum(axis=-1)/3. > > 10 loops, best of 3: 85.3 ms per loop > > > > In [19]: timeit a.astype(float) > > 10 loops, best of 3: 23.3 ms per loop > > > > > > On my slower machine (older laptop, core2 duo), you can speed it up > more: > > In [3]: timeit a.astype(float).sum(axis=-1)/3.0 > 1 loops, best of 3: 235 ms per loop > > In [5]: timeit b = a.astype(float).sum(axis=-1); b /= 3.0 > 1 loops, best of 3: 181 ms per loop > > In [7]: timeit b = a.astype(np.float32).sum(axis=-1); b /= 3.0 > 10 loops, best of 3: 148 ms per loop > > If you really want float64, it is still faster to do the first operation > with single precision: > > In [8]: timeit b = a.astype(np.float32).sum(axis=-1).astype(np.float64); > b /= 3.0 > 10 loops, best of 3: 163 ms per loop > > Eric > > > > > > > > On Jun 20, 2011, at 4:15 PM, Alex Flint wrote: > > > >> At the moment I'm using numpy.dot to convert a WxHx3 RGB image > to a grayscale image: > >> > >> src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); > >> > >> This seems quite slow though (several seconds for a 3 megapixel > image) - is there a more specialized routine better suited to this? > >> > >> Cheers, > >> Alex > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > ------------------------------------------------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Mon Jun 20 20:38:28 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 20 Jun 2011 19:38:28 -0500 Subject: [Numpy-discussion] datetime pull request for review and build/test Message-ID: https://github.com/numpy/numpy/pull/93 The summary: * Tighten up date unit vs time unit casting rules, and integrate the NPY_CASTING enum deeper into the datetime conversions * Determine a unit when converting from a string array, similar to when converting from lists of strings * Switch local/utc handling to a timezone= parameter, which also accepts a datetime.tzinfo object. This, for example, enables the use of the pytz library with numpy.datetime64 * Remove the events metadata, make the old API functions raise exceptions, and rename the "frequency" metadata name to "timeunit" * Abstract the flexible dtype mechanism into a function, so that it can be more easily changed without having to add code to many places -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 20 21:06:37 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Tue, 21 Jun 2011 03:06:37 +0200 Subject: [Numpy-discussion] npyio -> gzip 271 Error -3 while decompressing ? In-Reply-To: References: <2E280CE7-0DDB-40B0-A849-61EDAB166874@astro.physik.uni-goettingen.de> Message-ID: <5254DFDB-CBD8-46E5-A2B4-1EF96636C54E@astro.physik.uni-goettingen.de> Moin Denis, On 20 Jun 2011, at 19:04, denis wrote: > a separate question, have you run genfromtxt( "xx.csv.gz" ) lately ? I haven't, and I was not particularly involved with it before this patch, so this would possibly be better addressed to the list. > On on .../scikits.learn-0.8/scikits/learn/datasets/data.digits.csv.gz > numpy 1.6.0, py 2.6 mac I get > > X = np.genfromtxt( filename, delimiter="," ) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/site-packages/numpy/lib/npyio.py", line 1271, in genfromtxt > first_line = fhd.next() > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/gzip.py", line 438, in next > line = self.readline() > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/gzip.py", line 393, in readline > c = self.read(readsize) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/gzip.py", line 219, in read > self._read(readsize) > File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ > python2.6/gzip.py", line 271, in _read > uncompress = self.decompress.decompress(buf) > zlib.error: Error -3 while decompressing: invalid distance too far > back > > It would be nice to fix this too, if it hasn't been already. > Btw the file gunzips fine. I could reproduce that error for the gzip'ed csv files in that directory; it can be isolated to the underlying gzip call above - fhd = gzip.open('digits.csv.gz', 'rbU'); fhd.next() produces the same error for these files with all python2.x versions on my Mac, but not with python3.x. Also only with the 'U' mode specified, yet the same mode is parsing other .gz files just fine. I could not really track down what the 'U' flag is doing in gzip.py, but I assume it is specifying some kind of unbuffered read. Also it's a mystery to me what is different in those files that would trigger the error. I even read them in with loadtxt() and wrote them back using constant line width and/or spaces as separators, still producing the same exception. The obvious place to fix this (or work around a bug in python2's gzip.py, whatever), would be changing the open command in genfromtxt fhd = iter(np.lib._datasource.open(fname, 'rbU')) to omit the 'U' at least with python2. Alternatively one could do a test read and catch the exception, to then re-open the file with mode 'rb'... Pierre, if you are reading this, can you comment how important the 'U' is for performance considerations or such? HTH, Derek From bsouthey at gmail.com Mon Jun 20 21:55:09 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 20 Jun 2011 20:55:09 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> Message-ID: On Mon, Jun 20, 2011 at 2:43 PM, Ralf Gommers wrote: > > > On Mon, Jun 20, 2011 at 8:50 PM, Bruce Southey wrote: >> >> I copied the files but that just moves the problem. So that patch is >> incorrect. >> >> I get the same errors on Fedora 15 supplied Python3.2 for numpy 1.6.0 and >> using git from 'https://github.com/rgommers/numpy.git'.? Numpy is getting >> Fedora supplied Atlas (1.5.1 does not). >> >> It appears that there is a misunderstanding of the PEP because 'SO' and >> 'SOABI' do exactly what the PEP says on my systems: > > It doesn't on OS X. But that's not even the issue. As I explained before, > the issue is that get_config_var('SO') is used to determine the extension of > system libraries (such as liblapack.so) and python-related ones (such as > multiarray.cpython-32m.so).? And the current functions don't do mindreading. >> >> >>> from distutils import sysconfig sysconfig.get_config_var('SO') >> '.cpython-32m.so' >> >>> sysconfig.get_config_var('SOABI') >> 'cpython-32m' >> >> Consequently, the name, 'multiarray.pyd', created within numpy is invalid. > > I removed the line in ctypeslib that was trying this, so I think you are not > testing my patch. > > Ralf > >> >> Looking the code, I see this line which makes no sense given that the >> second part is true under Linux: >> >> if (not is_python_ext) and 'SOABI' in >> distutils.sysconfig.get_config_vars(): >> >> So I think the 'get_shared_lib_extension' function is wrong and probably >> unneeded. >> >> >> Bruce >> Just to show that this is the new version, I added two print statements in the 'get_shared_lib_extension' function: >>> from numpy.distutils.misc_util import get_shared_lib_extension >>> get_shared_lib_extension(True) first so_ext .cpython-32mu.so returned so_ext .cpython-32mu.so '.cpython-32mu.so' >>> get_shared_lib_extension(False) first so_ext .cpython-32mu.so returned so_ext .so '.so' The reason for the same location is obvious because all the patch does is move the code to get the extension into that function. So the 'get_shared_lib_extension' function returns the extension '.so' to the load_library function. However that name is wrong under Linux as it has to be 'multiarray.cpython-32mu.so' and hence the error in the same location. I did come across this thread 'http://bugs.python.org/issue10262' which indicates why Linux is different by default. So what is the actual name of the multiarray shared library with the Mac? If it is ' 'multiarray.so' then the correct name is "libname + sysconfig.get_config_var('SO')" as I previously indicated. Bruce $ python3 Python 3.2 (r32:88445, Feb 21 2011, 21:11:06) [GCC 4.6.0 20110212 (Red Hat 4.6.0-0.7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> np.test() Running unit tests for numpy NumPy version 2.0.0.dev-Unknown NumPy is installed in /usr/lib64/python3.2/site-packages/numpy Python version 3.2 (r32:88445, Feb 21 2011, 21:11:06) [GCC 4.6.0 20110212 (Red Hat 4.6.0-0.7)] nose version 1.0.0 first so_ext .cpython-32mu.so returned so_ext .so ...F.......S......F.....E..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................................................................................................................................K..................................................................................................K......................K...........................................................................................first so_ext .cpython-32mu.so returned so_ext .so ...............S..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/site-packages/numpy/lib/format.py:575: ResourceWarning: unclosed file <_io.BufferedReader name='/tmp/tmpw661ug'> mode=mode, offset=offset) .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=3> return Popen(*popenargs, **kwargs).wait() /usr/lib64/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=8> return Popen(*popenargs, **kwargs).wait() ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................first so_ext .cpython-32mu.so returned so_ext .so E........ ====================================================================== ERROR: test_datetime_divide (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 762, in test_datetime_divide assert_equal(tda / tdb, 6.0 / 9.0) TypeError: internal error: could not find appropriate datetime inner loop in true_divide ufunc ====================================================================== ERROR: Failure: OSError (/usr/lib64/python3.2/site-packages/numpy/core/multiarray.so: cannot open shared object file: No such file or directory) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/failure.py", line 37, in runTest raise self.exc_class(self.exc_val).with_traceback(self.tb) File "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/loader.py", line 390, in loadTestsFromName addr.filename, addr.module) File "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/importer.py", line 39, in importFromPath return self.importFromDir(dir_path, fqname) File "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/importer.py", line 86, in importFromDir mod = load_module(part_fqname, fh, filename, desc) File "/usr/lib64/python3.2/site-packages/numpy/tests/test_ctypeslib.py", line 9, in cdll = load_library('multiarray', np.core.multiarray.__file__) File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line 124, in load_library raise exc File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line 121, in load_library return ctypes.cdll[libpath] File "/usr/lib64/python3.2/ctypes/__init__.py", line 415, in __getitem__ return getattr(self, name) File "/usr/lib64/python3.2/ctypes/__init__.py", line 410, in __getattr__ dll = self._dlltype(name) File "/usr/lib64/python3.2/ctypes/__init__.py", line 340, in __init__ self._handle = _dlopen(self._name, mode) OSError: /usr/lib64/python3.2/site-packages/numpy/core/multiarray.so: cannot open shared object file: No such file or directory ====================================================================== FAIL: Test custom format function for each element in array. ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_arrayprint.py", line 86, in test_format_function "[0x0L 0x1L 0x2L]") File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: '[0x0 0x1 0x2]' DESIRED: '[0x0L 0x1L 0x2L]' ====================================================================== FAIL: test_datetime_as_string (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 1001, in test_datetime_as_string '1959') File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: b'1959' DESIRED: '1959' ---------------------------------------------------------------------- Ran 3215 tests in 45.744s FAILED (KNOWNFAIL=4, SKIP=2, errors=2, failures=2) From jsseabold at gmail.com Mon Jun 20 23:30:04 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Mon, 20 Jun 2011 23:30:04 -0400 Subject: [Numpy-discussion] numpy submodules missing in intersphinx objects.inv? Message-ID: I was just trying to link to the numpy.testing module but can't using intersphinx. Does anyone know why certain submodules aren't included in objects.inv? It looks as though it has something to do either with having a reference at the top of the rst file (so you can :ref: link to it) or having that module autodoc'd (then you can :mod: link to it), but I can't find any actual documentation about this actually being the case. This is the relevant ticket http://projects.scipy.org/numpy/ticket/1432 Skipper From ralf.gommers at googlemail.com Tue Jun 21 02:01:04 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Tue, 21 Jun 2011 08:01:04 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> Message-ID: On Tue, Jun 21, 2011 at 3:55 AM, Bruce Southey wrote: > On Mon, Jun 20, 2011 at 2:43 PM, Ralf Gommers > wrote: > > > > > > On Mon, Jun 20, 2011 at 8:50 PM, Bruce Southey > wrote: > >> > >> I copied the files but that just moves the problem. So that patch is > >> incorrect. > >> > >> I get the same errors on Fedora 15 supplied Python3.2 for numpy 1.6.0 > and > >> using git from 'https://github.com/rgommers/numpy.git'. Numpy is > getting > >> Fedora supplied Atlas (1.5.1 does not). > >> > >> It appears that there is a misunderstanding of the PEP because 'SO' and > >> 'SOABI' do exactly what the PEP says on my systems: > > > > It doesn't on OS X. But that's not even the issue. As I explained before, > > the issue is that get_config_var('SO') is used to determine the extension > of > > system libraries (such as liblapack.so) and python-related ones (such as > > multiarray.cpython-32m.so). And the current functions don't do > mindreading. > >> > >> >>> from distutils import sysconfig sysconfig.get_config_var('SO') > >> '.cpython-32m.so' > >> >>> sysconfig.get_config_var('SOABI') > >> 'cpython-32m' > >> > >> Consequently, the name, 'multiarray.pyd', created within numpy is > invalid. > > > > I removed the line in ctypeslib that was trying this, so I think you are > not > > testing my patch. > > > > Ralf > > > >> > >> Looking the code, I see this line which makes no sense given that the > >> second part is true under Linux: > >> > >> if (not is_python_ext) and 'SOABI' in > >> distutils.sysconfig.get_config_vars(): > >> > >> So I think the 'get_shared_lib_extension' function is wrong and probably > >> unneeded. > >> > >> > >> Bruce > >> > > Just to show that this is the new version, I added two print > statements in the 'get_shared_lib_extension' function: > >>> from numpy.distutils.misc_util import get_shared_lib_extension > >>> get_shared_lib_extension(True) > first so_ext .cpython-32mu.so > returned so_ext .cpython-32mu.so > '.cpython-32mu.so' > >>> get_shared_lib_extension(False) > first so_ext .cpython-32mu.so > returned so_ext .so > '.so' > This all looks correct. Before you were saying you were still getting 'multiarray.pyd', now your error says 'multiarray.so'. So now you are testing the right thing. Test test_basic2() in test_ctypeslib was fixed, but I forgot to fix it in two other places. I updated both my branches on github, please try again. > > The reason for the same location is obvious because all the patch does > is move the code to get the extension into that function. So the > 'get_shared_lib_extension' function returns the extension '.so' to the > load_library function. However that name is wrong under Linux as it > has to be 'multiarray.cpython-32mu.so' and hence the error in the same > location. I did come across this thread > 'http://bugs.python.org/issue10262' which indicates why Linux is > different by default. > > So what is the actual name of the multiarray shared library with the Mac? > If it is ' 'multiarray.so' then the correct name is "libname + > sysconfig.get_config_var('SO')" as I previously indicated. > > It is, and yes that's correct. Orthogonal to the actual issue though. Ralf > > Bruce > > > > > $ python3 > Python 3.2 (r32:88445, Feb 21 2011, 21:11:06) > [GCC 4.6.0 20110212 (Red Hat 4.6.0-0.7)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> np.test() > Running unit tests for numpy > NumPy version 2.0.0.dev-Unknown > NumPy is installed in /usr/lib64/python3.2/site-packages/numpy > Python version 3.2 (r32:88445, Feb 21 2011, 21:11:06) [GCC 4.6.0 > 20110212 (Red Hat 4.6.0-0.7)] > nose version 1.0.0 > first so_ext .cpython-32mu.so > returned so_ext .so > > ...F.......S......F.....E..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................................................................................................................................K..................................................................................................K......................K...........................................................................................first > so_ext .cpython-32mu.so > returned so_ext .so > > ...............S..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/site-packages/numpy/lib/format.py:575: > ResourceWarning: unclosed file <_io.BufferedReader > name='/tmp/tmpw661ug'> > mode=mode, offset=offset) > > .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/subprocess.py:460: > ResourceWarning: unclosed file <_io.BufferedReader name=3> > return Popen(*popenargs, **kwargs).wait() > /usr/lib64/python3.2/subprocess.py:460: ResourceWarning: unclosed file > <_io.BufferedReader name=8> > return Popen(*popenargs, **kwargs).wait() > > ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................first > so_ext .cpython-32mu.so > returned so_ext .so > E........ > ====================================================================== > ERROR: test_datetime_divide (test_datetime.TestDateTime) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", > line 762, in test_datetime_divide > assert_equal(tda / tdb, 6.0 / 9.0) > TypeError: internal error: could not find appropriate datetime inner > loop in true_divide ufunc > > ====================================================================== > ERROR: Failure: OSError > (/usr/lib64/python3.2/site-packages/numpy/core/multiarray.so: cannot > open shared object file: No such file or directory) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/failure.py", > line 37, in runTest > raise self.exc_class(self.exc_val).with_traceback(self.tb) > File > "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/loader.py", > line 390, in loadTestsFromName > addr.filename, addr.module) > File > "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/importer.py", > line 39, in importFromPath > return self.importFromDir(dir_path, fqname) > File > "/usr/lib/python3.2/site-packages/nose-1.0.0-py3.2.egg/nose/importer.py", > line 86, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File "/usr/lib64/python3.2/site-packages/numpy/tests/test_ctypeslib.py", > line 9, in > cdll = load_library('multiarray', np.core.multiarray.__file__) > File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line > 124, in load_library > raise exc > File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line > 121, in load_library > return ctypes.cdll[libpath] > File "/usr/lib64/python3.2/ctypes/__init__.py", line 415, in __getitem__ > return getattr(self, name) > File "/usr/lib64/python3.2/ctypes/__init__.py", line 410, in __getattr__ > dll = self._dlltype(name) > File "/usr/lib64/python3.2/ctypes/__init__.py", line 340, in __init__ > self._handle = _dlopen(self._name, mode) > OSError: /usr/lib64/python3.2/site-packages/numpy/core/multiarray.so: > cannot open shared object file: No such file or directory > > ====================================================================== > FAIL: Test custom format function for each element in array. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_arrayprint.py", > line 86, in test_format_function > "[0x0L 0x1L 0x2L]") > File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", > line 313, in assert_equal > raise AssertionError(msg) > AssertionError: > Items are not equal: > ACTUAL: '[0x0 0x1 0x2]' > DESIRED: '[0x0L 0x1L 0x2L]' > > ====================================================================== > FAIL: test_datetime_as_string (test_datetime.TestDateTime) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", > line 1001, in test_datetime_as_string > '1959') > File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", > line 313, in assert_equal > raise AssertionError(msg) > AssertionError: > Items are not equal: > ACTUAL: b'1959' > DESIRED: '1959' > > ---------------------------------------------------------------------- > Ran 3215 tests in 45.744s > > FAILED (KNOWNFAIL=4, SKIP=2, errors=2, failures=2) > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaeladamkatz at yahoo.com Tue Jun 21 03:05:39 2011 From: michaeladamkatz at yahoo.com (Michael Katz) Date: Tue, 21 Jun 2011 00:05:39 -0700 (PDT) Subject: [Numpy-discussion] faster in1d() for monotonic case? In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> Message-ID: <302267.43619.qm@web161306.mail.bf1.yahoo.com> The following call is a bottleneck for me: ??? np.in1d( large_array.field_of_interest,?values_of_interest ) I'm not sure how in1d() is implemented, but this call seems to be slower than O(n) and faster than O( n**2 ), so perhaps it sorts the values_of_interest and does a binary search for each element of large_array? In any case, in my situation I actually know that field_of_interest increases monotonically across the large_array. So if I were writing this in C, I could do a simple O(n) loop by sorting values_of_interest and then just checking each value of large_array against values_of_interest[ i ] and values_of_interest[ i + 1 ], and any time it matched values_of_interest[ i + 1 ] increment i. Is there some way to achieve that same efficiency in numpy, taking advantage of the monotonic nature of field_of_interest? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdforbes at gmail.com Tue Jun 21 04:22:54 2011 From: bdforbes at gmail.com (Ben Forbes) Date: Tue, 21 Jun 2011 18:22:54 +1000 Subject: [Numpy-discussion] Controlling endianness of ndarray.tofile() Message-ID: Hi, On my system (Intel Xeon, Windows 7 64-bit), ndarray.tofile() outputs in little-endian. This is a bit inconvenient, since everything else I do is in big-endian. Unfortunately, scipy.io.write_arrray() is deprecated, and I can't find any other routines that write pure raw binary. Are there any other options, or perhaps could tofile() be modified to allow control over endianness? Cheers, Ben -- Benjamin D. Forbes School of Physics The University of Melbourne Parkville, VIC 3010, Australia From nadavh at visionsense.com Tue Jun 21 05:33:24 2011 From: nadavh at visionsense.com (Nadav Horesh) Date: Tue, 21 Jun 2011 02:33:24 -0700 Subject: [Numpy-discussion] =?windows-1255?b?+vnl4eQ6ICBmYXN0ZXIgaW4xZCgp?= =?windows-1255?q?_for_monotonic_case=3F?= In-Reply-To: <302267.43619.qm@web161306.mail.bf1.yahoo.com> Message-ID: <26FC23E7C398A64083C980D16001012D246BAF6BC4@VA3DIAXVS361.RED001.local> Did you try searchsorted? Nadav ________________________________ ???: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] ??? Michael Katz ????: Tuesday, June 21, 2011 10:06 ??: Discussion of Numerical Python ????: [Numpy-discussion] faster in1d() for monotonic case? The following call is a bottleneck for me: np.in1d( large_array.field_of_interest, values_of_interest ) I'm not sure how in1d() is implemented, but this call seems to be slower than O(n) and faster than O( n**2 ), so perhaps it sorts the values_of_interest and does a binary search for each element of large_array? In any case, in my situation I actually know that field_of_interest increases monotonically across the large_array. So if I were writing this in C, I could do a simple O(n) loop by sorting values_of_interest and then just checking each value of large_array against values_of_interest[ i ] and values_of_interest[ i + 1 ], and any time it matched values_of_interest[ i + 1 ] increment i. Is there some way to achieve that same efficiency in numpy, taking advantage of the monotonic nature of field_of_interest? -------------- next part -------------- An HTML attachment was scrubbed... URL: From gruben at bigpond.net.au Tue Jun 21 06:37:10 2011 From: gruben at bigpond.net.au (gary ruben) Date: Tue, 21 Jun 2011 20:37:10 +1000 Subject: [Numpy-discussion] Controlling endianness of ndarray.tofile() In-Reply-To: References: Message-ID: Hi Ben, based on this example I suspect the way to do it is with numpy.byteswap() and numpy.tofile() >From we can do >>> A = np.array([1, 256, 8755], dtype=np.int16) >>> map(hex, A) ['0x1', '0x100', '0x2233'] >>> A.tofile('a_little.bin') >>> A.byteswap(True) array([ 256, 1, 13090], dtype=int16) >>> map(hex, A) ['0x100', '0x1', '0x3322'] >>> A.tofile('a_big.bin') Gary On Tue, Jun 21, 2011 at 6:22 PM, Ben Forbes wrote: > Hi, > > On my system (Intel Xeon, Windows 7 64-bit), ndarray.tofile() outputs > in little-endian. This is a bit inconvenient, since everything else I > do is in big-endian. Unfortunately, scipy.io.write_arrray() is > deprecated, and I can't find any other routines that write pure raw > binary. Are there any other options, or perhaps could tofile() be > modified to allow control over endianness? > > Cheers, > Ben > > -- > Benjamin D. Forbes > School of Physics > The University of Melbourne > Parkville, VIC 3010, Australia > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From bdforbes at gmail.com Tue Jun 21 07:15:55 2011 From: bdforbes at gmail.com (Ben Forbes) Date: Tue, 21 Jun 2011 21:15:55 +1000 Subject: [Numpy-discussion] Controlling endianness of ndarray.tofile() In-Reply-To: References: Message-ID: Thanks Gary, that works. Out of interest I timed it: http://pastebin.com/HA4Qn9Ge On average the swapping incurred a 0.04 second penalty (compared with 1.5 second total run time) for a 4096x4096 array of 64-bit reals. So there is no real penalty. Cheers, Ben On Tue, Jun 21, 2011 at 8:37 PM, gary ruben wrote: > Hi Ben, > based on this example > > I suspect the way to do it is with numpy.byteswap() and numpy.tofile() > >From > we can do > >>>> A = np.array([1, 256, 8755], dtype=np.int16) >>>> map(hex, A) > ['0x1', '0x100', '0x2233'] >>>> A.tofile('a_little.bin') >>>> A.byteswap(True) > array([ ?256, ? ? 1, 13090], dtype=int16) >>>> map(hex, A) > ['0x100', '0x1', '0x3322'] >>>> A.tofile('a_big.bin') > > Gary > > On Tue, Jun 21, 2011 at 6:22 PM, Ben Forbes wrote: >> Hi, >> >> On my system (Intel Xeon, Windows 7 64-bit), ndarray.tofile() outputs >> in little-endian. This is a bit inconvenient, since everything else I >> do is in big-endian. Unfortunately, scipy.io.write_arrray() is >> deprecated, and I can't find any other routines that write pure raw >> binary. Are there any other options, or perhaps could tofile() be >> modified to allow control over endianness? >> >> Cheers, >> Ben >> >> -- >> Benjamin D. Forbes >> School of Physics >> The University of Melbourne >> Parkville, VIC 3010, Australia >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Benjamin D. Forbes School of Physics The University of Melbourne Parkville, VIC 3010, Australia From bsouthey at gmail.com Tue Jun 21 10:38:15 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Jun 2011 09:38:15 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> Message-ID: <4E00ACD7.5000504@gmail.com> On 06/21/2011 01:01 AM, Ralf Gommers wrote: > > > On Tue, Jun 21, 2011 at 3:55 AM, Bruce Southey > wrote: > > On Mon, Jun 20, 2011 at 2:43 PM, Ralf Gommers > > > wrote: > > > > > > On Mon, Jun 20, 2011 at 8:50 PM, Bruce Southey > > wrote: > >> > >> I copied the files but that just moves the problem. So that > patch is > >> incorrect. > >> > >> I get the same errors on Fedora 15 supplied Python3.2 for numpy > 1.6.0 and > >> using git from 'https://github.com/rgommers/numpy.git'. Numpy > is getting > >> Fedora supplied Atlas (1.5.1 does not). > >> > >> It appears that there is a misunderstanding of the PEP because > 'SO' and > >> 'SOABI' do exactly what the PEP says on my systems: > > > > It doesn't on OS X. But that's not even the issue. As I > explained before, > > the issue is that get_config_var('SO') is used to determine the > extension of > > system libraries (such as liblapack.so) and python-related ones > (such as > > multiarray.cpython-32m.so ). > And the current functions don't do mindreading. > >> > >> >>> from distutils import sysconfig sysconfig.get_config_var('SO') > >> '.cpython-32m.so' > >> >>> sysconfig.get_config_var('SOABI') > >> 'cpython-32m' > >> > >> Consequently, the name, 'multiarray.pyd', created within numpy > is invalid. > > > > I removed the line in ctypeslib that was trying this, so I think > you are not > > testing my patch. > > > > Ralf > > > >> > >> Looking the code, I see this line which makes no sense given > that the > >> second part is true under Linux: > >> > >> if (not is_python_ext) and 'SOABI' in > >> distutils.sysconfig.get_config_vars(): > >> > >> So I think the 'get_shared_lib_extension' function is wrong and > probably > >> unneeded. > >> > >> > >> Bruce > >> > > Just to show that this is the new version, I added two print > statements in the 'get_shared_lib_extension' function: > >>> from numpy.distutils.misc_util import get_shared_lib_extension > >>> get_shared_lib_extension(True) > first so_ext .cpython-32mu.so > returned so_ext .cpython-32mu.so > '.cpython-32mu.so' > >>> get_shared_lib_extension(False) > first so_ext .cpython-32mu.so > returned so_ext .so > '.so' > > > This all looks correct. Before you were saying you were still getting > 'multiarray.pyd', now your error says 'multiarray.so'. So now you are > testing the right thing. Test test_basic2() in test_ctypeslib was > fixed, but I forgot to fix it in two other places. I updated both my > branches on github, please try again. > > > The reason for the same location is obvious because all the patch does > is move the code to get the extension into that function. So the > 'get_shared_lib_extension' function returns the extension '.so' to the > load_library function. However that name is wrong under Linux as it > has to be 'multiarray.cpython-32mu.so > ' and hence the error in the same > location. I did come across this thread > 'http://bugs.python.org/issue10262' which indicates why Linux is > different by default. > > So what is the actual name of the multiarray shared library with > the Mac? > If it is ' 'multiarray.so' then the correct name is "libname + > sysconfig.get_config_var('SO')" as I previously indicated. > > It is, and yes that's correct. Orthogonal to the actual issue though. > > Ralf > > > While the test now pass, you have now changed an API for load_library. This is not something that is meant to occur in a bug-fix release as well as the new argument is undocumented. But I do not understand the need for this extra complexity when "libname + sysconfig.get_config_var('SO')" works on Linux, Windows and Mac. Bruce $ git clone git://github.com/rgommers/numpy.git numpy $ cd numpy $ git checkout sharedlib-ext Switched to branch 'sharedlib-ext' $ git branch -a master * sharedlib-ext remotes/origin/1.5.x remotes/origin/HEAD -> origin/master remotes/origin/compilation-issues-doc remotes/origin/doc-noinstall remotes/origin/maintenance/1.4.x remotes/origin/maintenance/1.5.x remotes/origin/maintenance/1.6.x remotes/origin/master remotes/origin/sharedlib-ext remotes/origin/sharedlib-ext-1.6.x remotes/origin/swigopts remotes/origin/ticket-1218-array2string remotes/origin/ticket-1689-fromstring remotes/origin/ticket-99 remotes/origin/warn-noclean-build [built and installed numpy] $ python3 Python 3.2 (r32:88445, Feb 21 2011, 21:11:06) [GCC 4.6.0 20110212 (Red Hat 4.6.0-0.7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> np.__version__ '2.0.0.dev-Unknown' >>> np.test() Running unit tests for numpy NumPy version 2.0.0.dev-Unknown NumPy is installed in /usr/lib64/python3.2/site-packages/numpy Python version 3.2 (r32:88445, Feb 21 2011, 21:11:06) [GCC 4.6.0 20110212 (Red Hat 4.6.0-0.7)] nose version 1.0.0 ...F..............F.....E..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K............................................................................................................................................................................................K..................................................................................................K......................K..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/site-packages/numpy/lib/format.py:575: ResourceWarning: unclosed file <_io.BufferedReader name='/tmp/tmphws7d4'> mode=mode, offset=offset) .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................../usr/lib64/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=3> return Popen(*popenargs, **kwargs).wait() /usr/lib64/python3.2/subprocess.py:460: ResourceWarning: unclosed file <_io.BufferedReader name=8> return Popen(*popenargs, **kwargs).wait() ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ ====================================================================== ERROR: test_datetime_divide (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 762, in test_datetime_divide assert_equal(tda / tdb, 6.0 / 9.0) TypeError: internal error: could not find appropriate datetime inner loop in true_divide ufunc ====================================================================== FAIL: Test custom format function for each element in array. ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_arrayprint.py", line 86, in test_format_function "[0x0L 0x1L 0x2L]") File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: '[0x0 0x1 0x2]' DESIRED: '[0x0L 0x1L 0x2L]' ====================================================================== FAIL: test_datetime_as_string (test_datetime.TestDateTime) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 1001, in test_datetime_as_string '1959') File "/usr/lib64/python3.2/site-packages/numpy/testing/utils.py", line 313, in assert_equal raise AssertionError(msg) AssertionError: Items are not equal: ACTUAL: b'1959' DESIRED: '1959' ---------------------------------------------------------------------- Ran 3578 tests in 36.278s FAILED (KNOWNFAIL=4, errors=1, failures=2) >>> >>> from numpy.distutils.misc_util import get_shared_lib_extension >>> get_shared_lib_extension(True) '.cpython-32mu.so' >>> get_shared_lib_extension(False) '.so' >>> load_library('multiarray', '/usr/lib64/python3.2/site-packages/numpy/core/', True) >>> load_library('multiarray', '/usr/lib64/python3.2/site-packages/numpy/core/') Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line 124, in load_library raise exc File "/usr/lib64/python3.2/site-packages/numpy/ctypeslib.py", line 121, in load_library return ctypes.cdll[libpath] File "/usr/lib64/python3.2/ctypes/__init__.py", line 415, in __getitem__ return getattr(self, name) File "/usr/lib64/python3.2/ctypes/__init__.py", line 410, in __getattr__ dll = self._dlltype(name) File "/usr/lib64/python3.2/ctypes/__init__.py", line 340, in __init__ self._handle = _dlopen(self._name, mode) OSError: /usr/lib64/python3.2/site-packages/numpy/core/multiarray.so: cannot open shared object file: No such file or directory -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaeladamkatz at yahoo.com Tue Jun 21 11:15:50 2011 From: michaeladamkatz at yahoo.com (Michael Katz) Date: Tue, 21 Jun 2011 08:15:50 -0700 (PDT) Subject: [Numpy-discussion] =?utf-8?b?16rXqdeV15HXlDogIGZhc3RlciBpbjFkKCkg?= =?utf-8?q?for_monotonic_case=3F?= In-Reply-To: <26FC23E7C398A64083C980D16001012D246BAF6BC4@VA3DIAXVS361.RED001.local> References: <26FC23E7C398A64083C980D16001012D246BAF6BC4@VA3DIAXVS361.RED001.local> Message-ID: <911482.90251.qm@web161310.mail.bf1.yahoo.com> I'm not quite sure how to use?searchsorted to get the output I need (e.g., the length of?the output needs to be as long as large_array). But in any case it says it uses binary search, so it would seem to be an O( n * log( n ) ) solution, whereas I'm hoping for an O( n ) solution. ________________________________ From: Nadav Horesh To: Discussion of Numerical Python Sent: Tue, June 21, 2011 2:33:24 AM Subject: [Numpy-discussion] ?????: faster in1d() for monotonic case? Did you try searchsorted? ? ? Nadav ? ________________________________ ???:numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] ??? Michael Katz ????: Tuesday, June 21, 2011 10:06 ??: Discussion of Numerical Python ????: [Numpy-discussion] faster in1d() for monotonic case? ? The following call is a bottleneck for me: ? ??? np.in1d( large_array.field_of_interest,?values_of_interest ) ? I'm not sure how in1d() is implemented, but this call seems to be slower than O(n) and faster than O( n**2 ), so perhaps it sorts the values_of_interest and does a binary search for each element of large_array? ? In any case, in my situation I actually know that field_of_interest increases monotonically across the large_array. So if I were writing this in C, I could do a simple O(n) loop by sorting values_of_interest and then just checking each value of large_array against values_of_interest[ i ] and values_of_interest[ i + 1 ], and any time it matched values_of_interest[ i + 1 ] increment i. ? Is there some way to achieve that same efficiency in numpy, taking advantage of the monotonic nature of field_of_interest? -------------- next part -------------- An HTML attachment was scrubbed... URL: From zachary.pincus at yale.edu Tue Jun 21 12:46:03 2011 From: zachary.pincus at yale.edu (Zachary Pincus) Date: Tue, 21 Jun 2011 12:46:03 -0400 Subject: [Numpy-discussion] poor performance of sum with sub-machine-word integer types Message-ID: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> Hello all, As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... Is this something to do with numpy or something inexorable about machine / memory architecture? Zach Timings -- 64-bit mode: ---------------------- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 2.57 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.75 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 131 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 6.37 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 100 loops, best of 3: 16.6 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 15.1 ms per loop Timings -- 32-bit mode: ---------------------- In [2]: i = numpy.ones((1024,1024,4), numpy.int8) In [3]: timeit i.sum(axis=-1) 10 loops, best of 3: 138 ms per loop In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 3.68 ms per loop In [5]: i = numpy.ones((1024,1024,4), numpy.int16) In [6]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 4.17 ms per loop In [8]: i = numpy.ones((1024,1024,4), numpy.int32) In [9]: timeit i.sum(axis=-1) 10 loops, best of 3: 22.4 ms per loop In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.2 ms per loop In [11]: i = numpy.ones((1024,1024,4), numpy.int64) In [12]: timeit i.sum(axis=-1) 10 loops, best of 3: 29.2 ms per loop In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 10 loops, best of 3: 23.8 ms per loop From charlesr.harris at gmail.com Tue Jun 21 13:16:19 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jun 2011 11:16:19 -0600 Subject: [Numpy-discussion] poor performance of sum with sub-machine-word integer types In-Reply-To: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> References: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> Message-ID: On Tue, Jun 21, 2011 at 10:46 AM, Zachary Pincus wrote: > Hello all, > > As a result of the "fast greyscale conversion" thread, I noticed an anomaly > with numpy.ndararray.sum(): summing along certain axes is much slower with > sum() than versus doing it explicitly, but only with integer dtypes and when > the size of the dtype is less than the machine word. I checked in 32-bit and > 64-bit modes and in both cases only once the dtype got as large as that did > the speed difference go away. See below... > > Is this something to do with numpy or something inexorable about machine / > memory architecture? > > It's because of the type conversion sum uses by default for greater precision. In [8]: timeit i.sum(axis=-1) 10 loops, best of 3: 140 ms per loop In [9]: timeit i.sum(axis=-1, dtype=int8) 100 loops, best of 3: 16.2 ms per loop If you have 1.6, einsum is faster but also conserves type: In [10]: timeit einsum('ijk->ij', i) 100 loops, best of 3: 5.95 ms per loop We could probably make better loops for summing within kinds, i.e., accumulate in higher precision, then cast to specified precision. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Tue Jun 21 13:17:38 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Tue, 21 Jun 2011 10:17:38 -0700 Subject: [Numpy-discussion] poor performance of sum with sub-machine-word integer types In-Reply-To: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> References: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> Message-ID: On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus wrote: > Hello all, > > As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below... > > Is this something to do with numpy or something inexorable about machine / memory architecture? > > Zach > > Timings -- 64-bit mode: > ---------------------- > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > In [3]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 2.57 ms per loop > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > In [6]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 4.75 ms per loop > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > In [9]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 6.37 ms per loop > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > In [12]: timeit i.sum(axis=-1) > 100 loops, best of 3: 16.6 ms per loop > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 15.1 ms per loop > > > > Timings -- 32-bit mode: > ---------------------- > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > In [3]: timeit i.sum(axis=-1) > 10 loops, best of 3: 138 ms per loop > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 3.68 ms per loop > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > In [6]: timeit i.sum(axis=-1) > 10 loops, best of 3: 140 ms per loop > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 4.17 ms per loop > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > In [9]: timeit i.sum(axis=-1) > 10 loops, best of 3: 22.4 ms per loop > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 12.2 ms per loop > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > In [12]: timeit i.sum(axis=-1) > 10 loops, best of 3: 29.2 ms per loop > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 10 loops, best of 3: 23.8 ms per loop One difference is that i.sum() changes the output dtype of int input when the int dtype is less than the default int dtype: >> i.dtype dtype('int32') >> i.sum(axis=-1).dtype dtype('int64') # <-- dtype changed >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype dtype('int32') Here are my timings >> i = numpy.ones((1024,1024,4), numpy.int32) >> timeit i.sum(axis=-1) 1 loops, best of 3: 278 ms per loop >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.1 ms per loop >> import bottleneck as bn >> timeit bn.func.nansum_3d_int32_axis2(i) 100 loops, best of 3: 8.27 ms per loop Does making an extra copy of the input explain all of the speed difference (is this what np.sum does internally?): >> timeit i.astype(numpy.int64) 10 loops, best of 3: 29.2 ms per loop No. Initializing the output also adds some time: >> timeit np.empty((1024,1024,4), dtype=np.int32) 100000 loops, best of 3: 2.67 us per loop >> timeit np.empty((1024,1024,4), dtype=np.int64) 100000 loops, best of 3: 12.8 us per loop Switching back and forth between the input and output array takes more "memory" time too with int64 arrays compared to int32. From charlesr.harris at gmail.com Tue Jun 21 13:25:40 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jun 2011 11:25:40 -0600 Subject: [Numpy-discussion] poor performance of sum with sub-machine-word integer types In-Reply-To: References: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> Message-ID: On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman wrote: > On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus > wrote: > > Hello all, > > > > As a result of the "fast greyscale conversion" thread, I noticed an > anomaly with numpy.ndararray.sum(): summing along certain axes is much > slower with sum() than versus doing it explicitly, but only with integer > dtypes and when the size of the dtype is less than the machine word. I > checked in 32-bit and 64-bit modes and in both cases only once the dtype got > as large as that did the speed difference go away. See below... > > > > Is this something to do with numpy or something inexorable about machine > / memory architecture? > > > > Zach > > > > Timings -- 64-bit mode: > > ---------------------- > > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > > In [3]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 2.57 ms per loop > > > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > > In [6]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 4.75 ms per loop > > > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > > In [9]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 6.37 ms per loop > > > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > > In [12]: timeit i.sum(axis=-1) > > 100 loops, best of 3: 16.6 ms per loop > > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 15.1 ms per loop > > > > > > > > Timings -- 32-bit mode: > > ---------------------- > > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > > In [3]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 138 ms per loop > > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 3.68 ms per loop > > > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > > In [6]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 140 ms per loop > > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 4.17 ms per loop > > > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > > In [9]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 22.4 ms per loop > > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 12.2 ms per loop > > > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > > In [12]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 29.2 ms per loop > > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 10 loops, best of 3: 23.8 ms per loop > > One difference is that i.sum() changes the output dtype of int input > when the int dtype is less than the default int dtype: > > >> i.dtype > dtype('int32') > >> i.sum(axis=-1).dtype > dtype('int64') # <-- dtype changed > >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype > dtype('int32') > > Here are my timings > > >> i = numpy.ones((1024,1024,4), numpy.int32) > >> timeit i.sum(axis=-1) > 1 loops, best of 3: 278 ms per loop > >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 12.1 ms per loop > >> import bottleneck as bn > >> timeit bn.func.nansum_3d_int32_axis2(i) > 100 loops, best of 3: 8.27 ms per loop > > Does making an extra copy of the input explain all of the speed > difference (is this what np.sum does internally?): > > >> timeit i.astype(numpy.int64) > 10 loops, best of 3: 29.2 ms per loop > > No. > > I think you can see the overhead here: In [14]: timeit einsum('ijk->ij', i, dtype=int32) 100 loops, best of 3: 17.6 ms per loop In [15]: timeit einsum('ijk->ij', i, dtype=int64) 100 loops, best of 3: 18 ms per loop In [16]: timeit einsum('ijk->ij', i, dtype=int16) 100 loops, best of 3: 18.3 ms per loop In [17]: timeit einsum('ijk->ij', i, dtype=int8) 100 loops, best of 3: 5.87 ms per loop > Initializing the output also adds some time: > > >> timeit np.empty((1024,1024,4), dtype=np.int32) > 100000 loops, best of 3: 2.67 us per loop > >> timeit np.empty((1024,1024,4), dtype=np.int64) > 100000 loops, best of 3: 12.8 us per loop > > Switching back and forth between the input and output array takes more > "memory" time too with int64 arrays compared to int32. > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 21 13:36:17 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jun 2011 11:36:17 -0600 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe wrote: > NumPy has a mechanism built in to allow subclasses to adjust or override > aspects of the ufunc behavior. While this goal is important, this mechanism > only allows for very limited customization, making for instance the masked > arrays unable to work with the native ufuncs in a full and proper way. I > would like to deprecate the current mechanism, in particular > __array_prepare__ and __array_wrap__, and introduce a new method I will > describe below. If you've ever used these mechanisms, please review this > design to see if it meets your needs. > > > The current approach is at a dead end, so something better needs to be done. > Any class type which would like to override its behavior in ufuncs would > define a method called _numpy_ufunc_, and optionally an attribute > __array_priority__ as can already be done. The class which wins the priority > battle gets its _numpy_ufunc_ function called as follows: > > return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) > > > To support this overloading, the ufunc would get a new support method, > result_type, and there would be a new global function, broadcast_empty_like. > > The function ufunc.empty_like behaves like the global np.result_type, but > produces the output type or a tuple of output types specific to the ufunc, > which may follow a different convention than regular arithmetic type > promotion. This allows for a class to create an output array of the correct > type to pass to the ufunc if it needs to be different than the default. > > The function broadcast_empty_like is just like empty_like, but takes a list > or tuple of arrays which are to be broadcast together for producing the > output, instead of just one. > > How does the ufunc get called so it doesn't get caught in an endless loop? I like the proposed method if it can also be used for classes that don't subclass ndarray. Masked array, for instance, should probably not subclass ndarray. > Thanks, > Mark > > > A simple class which overrides the ufuncs might look as follows: > > def sin(ufunc, *args, **kwargs): > # Convert degrees to radians > args[0] = np.deg2rad(args[0]) > # Return a regular array, since the result is not in degrees > return ufunc(*args, **kwargs) > > class MyDegreesClass: > """Array-like object with a degrees unit""" > > def __init__(arr): > self.arr = arr > > def _numpy_ufunc_(ufunc, *args, **kwargs): > override = globals().get(ufunc.name) > if override: > return override(ufunc, *args, **kwargs) > else: > raise TypeError, 'ufunc %s incompatible with MyDegreesClass' % > ufunc.name > > > > A more complex example will be something like this: > > def my_general_ufunc(ufunc, *args, **kwargs): > # Extract the 'out' argument. This only supports ufuncs with > # one output, currently. > out = kwargs.get('out') > if len(args) > ufunc.nin: > if out is None: > out = args[ufunc.nin] > else: > raise ValueError, "'out' given as both a position and keyword > argument" > > # Just want the inputs from here on > args = args[:ufunc.nin] > > # Strip out MyArrayClass, but allow operations with regular ndarrays > raw_in = [] > for a in args: > if isinstance(a, MyArrayClass): > raw_in.append(a.arr) > else: > raw_in.append(a) > > # Allocate the output array > if not out is None: > if isinstance(out, MyArrayClass): > raise TypeError, "'out' must have type MyArrayClass" > else: > # Create the output array, obeying the 'order' parameter, > # but disallowing subclasses > out = np.broadcast_empty_like([args, > order=kwargs.get('order'), > dtype=ufunc.result_type(args), > subok=False) > > # Override the output argument > kwargs['out'] = out.arr > > # Call the ufunc > ufunc(*args, **kwargs) > > # Return the output > return out > > class MyArrayClass: > def __init__(arr): > self.arr = arr > > def _numpy_ufunc_(ufunc, *args, **kwargs): > override = globals().get(ufunc.name) > if override: > return override(ufunc, *args, **kwargs) > else: > return my_general_ufunc(ufunc, *args, **kwargs) > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From zachary.pincus at yale.edu Tue Jun 21 13:38:12 2011 From: zachary.pincus at yale.edu (Zachary Pincus) Date: Tue, 21 Jun 2011 13:38:12 -0400 Subject: [Numpy-discussion] poor performance of sum with sub-machine-word integer types In-Reply-To: References: <1CC79FFB-1EB6-4F55-AB02-98F576EC0032@yale.edu> Message-ID: <42A45493-4CA9-4943-ACBB-8905521B1AE7@yale.edu> On Jun 21, 2011, at 1:16 PM, Charles R Harris wrote: > It's because of the type conversion sum uses by default for greater precision. Aah, makes sense. Thanks for the detailed explanations and timings! From ndbecker2 at gmail.com Tue Jun 21 13:49:26 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Tue, 21 Jun 2011 13:49:26 -0400 Subject: [Numpy-discussion] fast numpy i/o Message-ID: I'm wondering what are good choices for fast numpy array serialization? mmap: fast, but I guess not self-describing? hdf5: ? pickle: self-describing, but maybe not fast? others? From ndbecker2 at gmail.com Tue Jun 21 13:58:25 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Tue, 21 Jun 2011 13:58:25 -0400 Subject: [Numpy-discussion] fast numpy i/o References: Message-ID: Neal Becker wrote: > I'm wondering what are good choices for fast numpy array serialization? > > mmap: fast, but I guess not self-describing? > hdf5: ? > pickle: self-describing, but maybe not fast? > others? I think, in addition, that hdf5 is the only one that easily interoperates with matlab? speaking of hdf5, I see: pyhdf5io 0.7 - Python module containing high-level hdf5 load and save functions. h5py 2.0.0 - Read and write HDF5 files from Python Any thoughts on the relative merits of these? From robert.kern at gmail.com Tue Jun 21 14:00:46 2011 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 21 Jun 2011 13:00:46 -0500 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 12:49, Neal Becker wrote: > I'm wondering what are good choices for fast numpy array serialization? > > mmap: fast, but I guess not self-describing? > hdf5: ? > pickle: self-describing, but maybe not fast? > others? NPY: http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html (Note the mmap_mode argument) https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From Chris.Barker at noaa.gov Tue Jun 21 14:01:25 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Tue, 21 Jun 2011 11:01:25 -0700 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: Message-ID: <4E00DC75.5080604@noaa.gov> Neal Becker wrote: > I'm wondering what are good choices for fast numpy array serialization? > > mmap: fast, but I guess not self-describing? > hdf5: ? Should be pretty fast, and self describing -- advantage of being a standard. Disadvantage is that it requires an hdf5 library, which can b a pain to install on some systems. > pickle: self-describing, but maybe not fast? > others? there is .tofile() and .fromfile() -- should be about as fast as you can get, not self-describing. Then there is .save(), savez() and .load() (.npz format)? It should be pretty fast, and self-describing (but not a standard outside of numpy). I doubt pickle will ever be your best bet. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Tue Jun 21 13:57:25 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 21 Jun 2011 12:57:25 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe wrote: > >> NumPy has a mechanism built in to allow subclasses to adjust or override >> aspects of the ufunc behavior. While this goal is important, this mechanism >> only allows for very limited customization, making for instance the masked >> arrays unable to work with the native ufuncs in a full and proper way. I >> would like to deprecate the current mechanism, in particular >> __array_prepare__ and __array_wrap__, and introduce a new method I will >> describe below. If you've ever used these mechanisms, please review this >> design to see if it meets your needs. >> >> >> > The current approach is at a dead end, so something better needs to be > done. > > >> Any class type which would like to override its behavior in ufuncs would >> define a method called _numpy_ufunc_, and optionally an attribute >> __array_priority__ as can already be done. The class which wins the priority >> battle gets its _numpy_ufunc_ function called as follows: >> >> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) >> >> >> To support this overloading, the ufunc would get a new support method, >> result_type, and there would be a new global function, broadcast_empty_like. >> >> The function ufunc.empty_like behaves like the global np.result_type, but >> produces the output type or a tuple of output types specific to the ufunc, >> which may follow a different convention than regular arithmetic type >> promotion. This allows for a class to create an output array of the correct >> type to pass to the ufunc if it needs to be different than the default. >> >> The function broadcast_empty_like is just like empty_like, but takes a >> list or tuple of arrays which are to be broadcast together for producing the >> output, instead of just one. >> >> > How does the ufunc get called so it doesn't get caught in an endless loop? > I like the proposed method if it can also be used for classes that don't > subclass ndarray. Masked array, for instance, should probably not subclass > ndarray. > The function being called needs to ensure this, either by extracting a raw ndarray from instances of its class, or adding a 'subok = False' parameter to the kwargs. Supporting objects that aren't ndarray subclasses is one of the purposes for this approach, and neither of my two example cases subclassed ndarray. -Mark > > >> Thanks, >> Mark >> >> >> A simple class which overrides the ufuncs might look as follows: >> >> def sin(ufunc, *args, **kwargs): >> # Convert degrees to radians >> args[0] = np.deg2rad(args[0]) >> # Return a regular array, since the result is not in degrees >> return ufunc(*args, **kwargs) >> >> class MyDegreesClass: >> """Array-like object with a degrees unit""" >> >> def __init__(arr): >> self.arr = arr >> >> def _numpy_ufunc_(ufunc, *args, **kwargs): >> override = globals().get(ufunc.name) >> if override: >> return override(ufunc, *args, **kwargs) >> else: >> raise TypeError, 'ufunc %s incompatible with MyDegreesClass' % >> ufunc.name >> >> >> >> A more complex example will be something like this: >> >> def my_general_ufunc(ufunc, *args, **kwargs): >> # Extract the 'out' argument. This only supports ufuncs with >> # one output, currently. >> out = kwargs.get('out') >> if len(args) > ufunc.nin: >> if out is None: >> out = args[ufunc.nin] >> else: >> raise ValueError, "'out' given as both a position and keyword >> argument" >> >> # Just want the inputs from here on >> args = args[:ufunc.nin] >> >> # Strip out MyArrayClass, but allow operations with regular ndarrays >> raw_in = [] >> for a in args: >> if isinstance(a, MyArrayClass): >> raw_in.append(a.arr) >> else: >> raw_in.append(a) >> >> # Allocate the output array >> if not out is None: >> if isinstance(out, MyArrayClass): >> raise TypeError, "'out' must have type MyArrayClass" >> else: >> # Create the output array, obeying the 'order' parameter, >> # but disallowing subclasses >> out = np.broadcast_empty_like([args, >> order=kwargs.get('order'), >> dtype=ufunc.result_type(args), >> subok=False) >> >> # Override the output argument >> kwargs['out'] = out.arr >> >> # Call the ufunc >> ufunc(*args, **kwargs) >> >> # Return the output >> return out >> >> class MyArrayClass: >> def __init__(arr): >> self.arr = arr >> >> def _numpy_ufunc_(ufunc, *args, **kwargs): >> override = globals().get(ufunc.name) >> if override: >> return override(ufunc, *args, **kwargs) >> else: >> return my_general_ufunc(ufunc, *args, **kwargs) >> >> > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abie at uw.edu Tue Jun 21 14:07:12 2011 From: abie at uw.edu (A. Flaxman) Date: Tue, 21 Jun 2011 18:07:12 +0000 Subject: [Numpy-discussion] what python module to handle csv? In-Reply-To: References: Message-ID: I am a huge fan of rec2csv and csv2rec, which might not technically be part of numpy, and can be found in pylab or the matplotlib.mlab module. --Abie From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Olivier Delalleau Sent: Wednesday, June 15, 2011 9:26 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] what python module to handle csv? Using savetxt with delimiter=',' should do the trick. If you want a more advanced csv interface to e.g. save more than a numpy array into a single csv, you can probably look into the python csv module. -=- Olivier 2011/6/15 Chao YUE > Dear all pythoners, what do you use python module to handle csv file (for reading we can use numpy.genfromtxt)? is there anyone we can do with csv file very convinient as that in R? can numpy.genfromtxt be used as writing? (I didn't try this yet because on our server we have only numpy 1.0.1...). This really make me struggling since csv is a very important interface file (I think so...). Thanks a lot, Sincerely, Chao -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Tue Jun 21 14:22:19 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Tue, 21 Jun 2011 11:22:19 -0700 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: Message-ID: <4E00E15B.709@noaa.gov> Neal Becker wrote: >> I'm wondering what are good choices for fast numpy array serialization? >> >> mmap: fast, but I guess not self-describing? >> hdf5: ? >> pickle: self-describing, but maybe not fast? >> others? > > I think, in addition, that hdf5 is the only one that easily interoperates with > matlab? netcdf is another option if you want an open standard supported by Matlab and the like. I like the netCDF4 package: http://code.google.com/p/netcdf4-python/ Though is can be kind of slow to write (I have no idea why!) plain old tofile() should be readable by other tools as well, as long as you have a way to specify what is in the file. > speaking of hdf5, I see: > > pyhdf5io 0.7 - Python module containing high-level hdf5 load and save > functions. > h5py 2.0.0 - Read and write HDF5 files from Python There is also pytables, which uses HDF5 under the hood. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From derek at astro.physik.uni-goettingen.de Tue Jun 21 14:24:08 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Tue, 21 Jun 2011 20:24:08 +0200 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: Message-ID: <0D1A51B9-8659-4E6D-AC1D-DC820D988990@astro.physik.uni-goettingen.de> On 21.06.2011, at 7:58PM, Neal Becker wrote: > I think, in addition, that hdf5 is the only one that easily interoperates with > matlab? > > speaking of hdf5, I see: > > pyhdf5io 0.7 - Python module containing high-level hdf5 load and save > functions. > h5py 2.0.0 - Read and write HDF5 files from Python > > Any thoughts on the relative merits of these? In my experience, HDF5 access usually approaches disk access speed, and random access to sub-datasets should be significantly faster than reading in the entire file, though I have not been able to test this. I have not heard about pyhdf5io (how does it work together with numpy?) - as alternative to h5py I'd rather recommend pytables, though I prefer the former for its cleaner/simpler interface (but that probably depends on your programming habits). HTH, Derek From charlesr.harris at gmail.com Tue Jun 21 14:28:11 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jun 2011 12:28:11 -0600 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 11:57 AM, Mark Wiebe wrote: > On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe wrote: >> >>> NumPy has a mechanism built in to allow subclasses to adjust or override >>> aspects of the ufunc behavior. While this goal is important, this mechanism >>> only allows for very limited customization, making for instance the masked >>> arrays unable to work with the native ufuncs in a full and proper way. I >>> would like to deprecate the current mechanism, in particular >>> __array_prepare__ and __array_wrap__, and introduce a new method I will >>> describe below. If you've ever used these mechanisms, please review this >>> design to see if it meets your needs. >>> >>> >>> >> The current approach is at a dead end, so something better needs to be >> done. >> >> >>> Any class type which would like to override its behavior in ufuncs would >>> define a method called _numpy_ufunc_, and optionally an attribute >>> __array_priority__ as can already be done. The class which wins the priority >>> battle gets its _numpy_ufunc_ function called as follows: >>> >>> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) >>> >>> >>> To support this overloading, the ufunc would get a new support method, >>> result_type, and there would be a new global function, broadcast_empty_like. >>> >>> The function ufunc.empty_like behaves like the global np.result_type, but >>> produces the output type or a tuple of output types specific to the ufunc, >>> which may follow a different convention than regular arithmetic type >>> promotion. This allows for a class to create an output array of the correct >>> type to pass to the ufunc if it needs to be different than the default. >>> >>> The function broadcast_empty_like is just like empty_like, but takes a >>> list or tuple of arrays which are to be broadcast together for producing the >>> output, instead of just one. >>> >>> >> How does the ufunc get called so it doesn't get caught in an endless loop? >> I like the proposed method if it can also be used for classes that don't >> subclass ndarray. Masked array, for instance, should probably not subclass >> ndarray. >> > > The function being called needs to ensure this, either by extracting a raw > ndarray from instances of its class, or adding a 'subok = False' parameter > to the kwargs. Supporting objects that aren't ndarray subclasses is one of > the purposes for this approach, and neither of my two example cases > subclassed ndarray. > > Sounds good. Many of the current uses of __array_wrap__ that I am aware of are in the wrappers in the linalg module and don't go through the ufunc machinery. How would that be handled? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From silyko at gmail.com Tue Jun 21 14:32:50 2011 From: silyko at gmail.com (Simon Lyngby Kokkendorff) Date: Tue, 21 Jun 2011 20:32:50 +0200 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: <0D1A51B9-8659-4E6D-AC1D-DC820D988990@astro.physik.uni-goettingen.de> References: <0D1A51B9-8659-4E6D-AC1D-DC820D988990@astro.physik.uni-goettingen.de> Message-ID: Hi, I have been using h5py a lot (both on windows and Mac OSX) and can only recommend it- haven't tried the other options though.... Cheers, Simon On Tue, Jun 21, 2011 at 8:24 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 21.06.2011, at 7:58PM, Neal Becker wrote: > > > I think, in addition, that hdf5 is the only one that easily interoperates > with > > matlab? > > > > speaking of hdf5, I see: > > > > pyhdf5io 0.7 - Python module containing high-level hdf5 load and save > > functions. > > h5py 2.0.0 - Read and write HDF5 files from Python > > > > Any thoughts on the relative merits of these? > > In my experience, HDF5 access usually approaches disk access speed, and > random access to sub-datasets should be significantly faster than reading in > the entire file, though I have not been able to test this. > I have not heard about pyhdf5io (how does it work together with numpy?) - > as alternative to h5py I'd rather recommend pytables, though I prefer the > former for its cleaner/simpler interface (but that probably depends on your > programming habits). > > HTH, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Barker at noaa.gov Tue Jun 21 14:35:01 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Tue, 21 Jun 2011 11:35:01 -0700 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: Message-ID: <4E00E455.2070604@noaa.gov> Robert Kern wrote: > https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt Just a note. From that doc: """ HDF5 is a complicated format that more or less implements a hierarchical filesystem-in-a-file. This fact makes satisfying some of the Requirements difficult. To the author's knowledge, as of this writing, there is no application or library that reads or writes even a subset of HDF5 files that does not use the canonical libhdf5 implementation. """ I'm pretty sure that the NetcdfJava libs, developed by Unidata, use their own home-grown code. netcdf4 is built on HDF5, so that qualifies as "a library that reads or writes a subset of HDF5 files". Perhaps there are lessons to be learned there. (too bad it's Java) """ Furthermore, by providing the first non-libhdf5 implementation of HDF5, we would be able to encourage more adoption of simple HDF5 in applications where it was previously infeasible because of the size of the library. """ I suppose this point is still true -- a C lib that supported a subset of hdf would be nice. That being said, I like the simplicity of the .npy format, and I don't know that anyone wants to take any of this on anyway. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From dsdale24 at gmail.com Tue Jun 21 14:46:15 2011 From: dsdale24 at gmail.com (Darren Dale) Date: Tue, 21 Jun 2011 14:46:15 -0400 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 2:28 PM, Charles R Harris wrote: > > > On Tue, Jun 21, 2011 at 11:57 AM, Mark Wiebe wrote: >> >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris >> wrote: >>> >>> >>> On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe wrote: >>>> >>>> NumPy has a mechanism built in to allow subclasses to adjust or override >>>> aspects of the ufunc behavior. While this goal is important, this mechanism >>>> only allows for very limited customization, making for instance the masked >>>> arrays unable to work with the native ufuncs in a full and proper way. I >>>> would like to deprecate the current mechanism, in particular >>>> __array_prepare__ and __array_wrap__, and introduce a new method I will >>>> describe below. If you've ever used these mechanisms, please review this >>>> design to see if it meets your needs. >>>> >>> >>> The current approach is at a dead end, so something better needs to be >>> done. >>> >>>> >>>> Any class type which would like to override its behavior in ufuncs would >>>> define a method called _numpy_ufunc_, and optionally an attribute >>>> __array_priority__ as can already be done. The class which wins the priority >>>> battle gets its _numpy_ufunc_ function called as follows: >>>> >>>> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) >>>> >>>> To support this overloading, the ufunc would get a new support method, >>>> result_type, and there would be a new global function, broadcast_empty_like. >>>> The function ufunc.empty_like behaves like the global np.result_type, >>>> but produces the output type or a tuple of output types specific to the >>>> ufunc, which may follow a different convention than regular arithmetic type >>>> promotion. This allows for a class to create an output array of the correct >>>> type to pass to the ufunc if it needs to be different than the default. >>>> The function broadcast_empty_like is just like empty_like, but takes a >>>> list or tuple of arrays which are to be broadcast together for producing the >>>> output, instead of just one. >>> >>> How does the ufunc get called so it doesn't get caught in an endless >>> loop? I like the proposed method if it can also be used for classes that >>> don't subclass ndarray. Masked array, for instance, should probably not >>> subclass ndarray. >> >> The function being called needs to ensure this, either by extracting a raw >> ndarray from instances of its class, or adding a 'subok = False' parameter >> to the kwargs. Supporting objects that aren't ndarray subclasses is one of >> the purposes for this approach, and neither of my two example cases >> subclassed ndarray. > > Sounds good. Many of the current uses of __array_wrap__ that I am aware of > are in the wrappers in the linalg module and don't go through the ufunc > machinery. How would that be handled? I contributed the __array_prepare__ method a while back so classes could raise errors before the array data is modified in place. Specifically, I was concerned about units support in my quantities package (http://pypi.python.org/pypi/quantities). But I agree that this approach is needs to be reconsidered. It would be nice for subclasses to have an opportunity to intercept and process the values passed to a ufunc on their way in. For example, it would be nice if when I did np.cos(1.5 degrees), my subclass could intercept the value and pass a new one on to the ufunc machinery that is expressed in radians. I thought PJ Eby's generic functions PEP would be a really good way to handle ufuncs, but the PEP has stagnated. Darren From charlesr.harris at gmail.com Tue Jun 21 15:02:09 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jun 2011 13:02:09 -0600 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 12:46 PM, Darren Dale wrote: > On Tue, Jun 21, 2011 at 2:28 PM, Charles R Harris > wrote: > > > > > > On Tue, Jun 21, 2011 at 11:57 AM, Mark Wiebe wrote: > >> > >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris > >> wrote: > >>> > >>> > >>> On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe > wrote: > >>>> > >>>> NumPy has a mechanism built in to allow subclasses to adjust or > override > >>>> aspects of the ufunc behavior. While this goal is important, this > mechanism > >>>> only allows for very limited customization, making for instance the > masked > >>>> arrays unable to work with the native ufuncs in a full and proper way. > I > >>>> would like to deprecate the current mechanism, in particular > >>>> __array_prepare__ and __array_wrap__, and introduce a new method I > will > >>>> describe below. If you've ever used these mechanisms, please review > this > >>>> design to see if it meets your needs. > >>>> > >>> > >>> The current approach is at a dead end, so something better needs to be > >>> done. > >>> > >>>> > >>>> Any class type which would like to override its behavior in ufuncs > would > >>>> define a method called _numpy_ufunc_, and optionally an attribute > >>>> __array_priority__ as can already be done. The class which wins the > priority > >>>> battle gets its _numpy_ufunc_ function called as follows: > >>>> > >>>> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) > >>>> > >>>> To support this overloading, the ufunc would get a new support method, > >>>> result_type, and there would be a new global function, > broadcast_empty_like. > >>>> The function ufunc.empty_like behaves like the global np.result_type, > >>>> but produces the output type or a tuple of output types specific to > the > >>>> ufunc, which may follow a different convention than regular arithmetic > type > >>>> promotion. This allows for a class to create an output array of the > correct > >>>> type to pass to the ufunc if it needs to be different than the > default. > >>>> The function broadcast_empty_like is just like empty_like, but takes a > >>>> list or tuple of arrays which are to be broadcast together for > producing the > >>>> output, instead of just one. > >>> > >>> How does the ufunc get called so it doesn't get caught in an endless > >>> loop? I like the proposed method if it can also be used for classes > that > >>> don't subclass ndarray. Masked array, for instance, should probably not > >>> subclass ndarray. > >> > >> The function being called needs to ensure this, either by extracting a > raw > >> ndarray from instances of its class, or adding a 'subok = False' > parameter > >> to the kwargs. Supporting objects that aren't ndarray subclasses is one > of > >> the purposes for this approach, and neither of my two example cases > >> subclassed ndarray. > > > > Sounds good. Many of the current uses of __array_wrap__ that I am aware > of > > are in the wrappers in the linalg module and don't go through the ufunc > > machinery. How would that be handled? > > I contributed the __array_prepare__ method a while back so classes > could raise errors before the array data is modified in place. > Specifically, I was concerned about units support in my quantities > package (http://pypi.python.org/pypi/quantities). But I agree that > this approach is needs to be reconsidered. It would be nice for > subclasses to have an opportunity to intercept and process the values > passed to a ufunc on their way in. For example, it would be nice if > when I did np.cos(1.5 degrees), my subclass could intercept the value > and pass a new one on to the ufunc machinery that is expressed in > radians. I thought PJ Eby's generic functions PEP would be a really > Link to PEP-3124 . Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From srean.list at gmail.com Tue Jun 21 15:35:17 2011 From: srean.list at gmail.com (srean) Date: Tue, 21 Jun 2011 14:35:17 -0500 Subject: [Numpy-discussion] (cumsum, broadcast) in (numexpr, weave) Message-ID: Hi All, is there a fast way to do cumsum with numexpr ? I could not find it, but the functions available in numexpr does not seem to be exhaustively documented, so it is possible that I missed it. Do not know if 'sum' takes special arguments that can be used. To try another track, does numexpr operators have something like the 'out' parameter for ufuncs ? If it is so, one could perhaps use add( a[0:-1], a[1,:], out = a[1,:) provided it is possible to preserve the sequential semantics. Another option is to use weave which does have cumsum. However my code requires expressions which implement broadcast. That leads to my next question, does repeat or concat return a copy or a view. If they avoid copying, I could perhaps use repeat to simulate efficient broadcasting. Or will it make a copy of that array anyway ?. I would ideally like to use numexpr because I make heavy use of transcendental functions and was hoping to exploit the VML library. Thanks for the help -- srean From srean.list at gmail.com Tue Jun 21 15:50:38 2011 From: srean.list at gmail.com (srean) Date: Tue, 21 Jun 2011 14:50:38 -0500 Subject: [Numpy-discussion] (cumsum, broadcast) in (numexpr, weave) In-Reply-To: References: Message-ID: Apologies, intended to send this to the scipy list. On Tue, Jun 21, 2011 at 2:35 PM, srean wrote: > Hi All, > > ?is there a fast way to do cumsum with numexpr ? I could not find it, > but the functions available in numexpr does not seem to be > exhaustively documented, so it is possible that I missed it. Do not > know if 'sum' takes special arguments that can be used. > > To try another track, does numexpr operators have something like the > 'out' parameter for ufuncs ? If it is so, one could perhaps use > add( a[0:-1], a[1,:], out = a[1,:) provided it is possible to preserve > the sequential semantics. > > Another option is to use weave which does have cumsum. However my code > requires ?expressions which implement broadcast. That leads to my next > question, does repeat or concat return a copy or a view. If they avoid > copying, I could perhaps use repeat to simulate efficient > broadcasting. Or will it make a copy of that array anyway ?. I would > ideally like to use numexpr because I make heavy use of transcendental > functions and was hoping to exploit the VML library. > > Thanks for the help > > -- srean > From ralf.gommers at googlemail.com Tue Jun 21 16:05:26 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Tue, 21 Jun 2011 22:05:26 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: <4E00ACD7.5000504@gmail.com> References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> <4E00ACD7.5000504@gmail.com> Message-ID: On Tue, Jun 21, 2011 at 4:38 PM, Bruce Southey wrote: > ** > On 06/21/2011 01:01 AM, Ralf Gommers wrote: > > > > On Tue, Jun 21, 2011 at 3:55 AM, Bruce Southey wrote: > >> On Mon, Jun 20, 2011 at 2:43 PM, Ralf Gommers >> wrote: >> > >> > >> > On Mon, Jun 20, 2011 at 8:50 PM, Bruce Southey >> wrote: >> >> >> >> I copied the files but that just moves the problem. So that patch is >> >> incorrect. >> >> >> >> I get the same errors on Fedora 15 supplied Python3.2 for numpy 1.6.0 >> and >> >> using git from 'https://github.com/rgommers/numpy.git'. Numpy is >> getting >> >> Fedora supplied Atlas (1.5.1 does not). >> >> >> >> It appears that there is a misunderstanding of the PEP because 'SO' and >> >> 'SOABI' do exactly what the PEP says on my systems: >> > >> > It doesn't on OS X. But that's not even the issue. As I explained >> before, >> > the issue is that get_config_var('SO') is used to determine the >> extension of >> > system libraries (such as liblapack.so) and python-related ones (such as >> > multiarray.cpython-32m.so). And the current functions don't do >> mindreading. >> >> >> >> >>> from distutils import sysconfig sysconfig.get_config_var('SO') >> >> '.cpython-32m.so' >> >> >>> sysconfig.get_config_var('SOABI') >> >> 'cpython-32m' >> >> >> >> Consequently, the name, 'multiarray.pyd', created within numpy is >> invalid. >> > >> > I removed the line in ctypeslib that was trying this, so I think you are >> not >> > testing my patch. >> > >> > Ralf >> > >> >> >> >> Looking the code, I see this line which makes no sense given that the >> >> second part is true under Linux: >> >> >> >> if (not is_python_ext) and 'SOABI' in >> >> distutils.sysconfig.get_config_vars(): >> >> >> >> So I think the 'get_shared_lib_extension' function is wrong and >> probably >> >> unneeded. >> >> >> >> >> >> Bruce >> >> >> >> Just to show that this is the new version, I added two print >> statements in the 'get_shared_lib_extension' function: >> >>> from numpy.distutils.misc_util import get_shared_lib_extension >> >>> get_shared_lib_extension(True) >> first so_ext .cpython-32mu.so >> returned so_ext .cpython-32mu.so >> '.cpython-32mu.so' >> >>> get_shared_lib_extension(False) >> first so_ext .cpython-32mu.so >> returned so_ext .so >> '.so' >> > > This all looks correct. Before you were saying you were still getting > 'multiarray.pyd', now your error says 'multiarray.so'. So now you are > testing the right thing. Test test_basic2() in test_ctypeslib was fixed, but > I forgot to fix it in two other places. I updated both my branches on > github, please try again. > >> >> The reason for the same location is obvious because all the patch does >> is move the code to get the extension into that function. So the >> 'get_shared_lib_extension' function returns the extension '.so' to the >> load_library function. However that name is wrong under Linux as it >> has to be 'multiarray.cpython-32mu.so' and hence the error in the same >> location. I did come across this thread >> 'http://bugs.python.org/issue10262' which indicates why Linux is >> different by default. >> >> So what is the actual name of the multiarray shared library with the Mac? >> If it is ' 'multiarray.so' then the correct name is "libname + >> sysconfig.get_config_var('SO')" as I previously indicated. >> >> It is, and yes that's correct. Orthogonal to the actual issue though. > > Ralf > > > > > While the test now pass, you have now changed an API for load_library. > Only in a backwards-compatible way, which should be fine. I added a keyword, the default of which does the same as before. The only thing I did other than that was remove tries with clearly invalid extensions, like ".pyd" on Linux. Now that I'm writing that though, I think it's better to try both ".so" and ".cpython-32mu.so" by default for python >=3.2. > This is not something that is meant to occur in a bug-fix release as well > as the new argument is undocumented. But I do not understand the need for > this extra complexity when "libname + sysconfig.get_config_var('SO')" works > on Linux, Windows and Mac. > > I've tried to explain this twice already. You have both " multiarray.cpython-32mu.so" and "liblapack.so" (or some other library like that) on your system. The extension of both is needed, and always obtained via get_config_var('SO'). See the problem? If someone knows a better way to do this, I'm all for it. But I don't see a simpler way. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at googlemail.com Tue Jun 21 16:52:21 2011 From: ralf.gommers at googlemail.com (Ralf Gommers) Date: Tue, 21 Jun 2011 22:52:21 +0200 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> <4E00ACD7.5000504@gmail.com> Message-ID: On Tue, Jun 21, 2011 at 10:05 PM, Ralf Gommers wrote: > > > On Tue, Jun 21, 2011 at 4:38 PM, Bruce Southey wrote: > >> ** >> On 06/21/2011 01:01 AM, Ralf Gommers wrote: >> >> >> >> On Tue, Jun 21, 2011 at 3:55 AM, Bruce Southey wrote: >> >> >>> So what is the actual name of the multiarray shared library with the Mac? >>> If it is ' 'multiarray.so' then the correct name is "libname + >>> sysconfig.get_config_var('SO')" as I previously indicated. >>> >>> It is, and yes that's correct. Orthogonal to the actual issue though. >> >> Ralf >> >> >> While the test now pass, you have now changed an API for load_library. >> > > Only in a backwards-compatible way, which should be fine. I added a > keyword, the default of which does the same as before. The only thing I did > other than that was remove tries with clearly invalid extensions, like > ".pyd" on Linux. Now that I'm writing that though, I think it's better to > try both ".so" and ".cpython-32mu.so" by default for python >=3.2. > This should try both extensions: https://github.com/rgommers/numpy/tree/sharedlibext Does that look better? Ralf > > >> This is not something that is meant to occur in a bug-fix release as well >> as the new argument is undocumented. But I do not understand the need for >> this extra complexity when "libname + sysconfig.get_config_var('SO')" works >> on Linux, Windows and Mac. >> >> I've tried to explain this twice already. You have both " > multiarray.cpython-32mu.so" and "liblapack.so" (or some other library like > that) on your system. The extension of both is needed, and always obtained > via get_config_var('SO'). See the problem? > > If someone knows a better way to do this, I'm all for it. But I don't see a > simpler way. > > Cheers, > Ralf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 21 17:43:13 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 21 Jun 2011 16:43:13 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 1:28 PM, Charles R Harris wrote: > > > On Tue, Jun 21, 2011 at 11:57 AM, Mark Wiebe wrote: > >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe wrote: >>> >>>> NumPy has a mechanism built in to allow subclasses to adjust or override >>>> aspects of the ufunc behavior. While this goal is important, this mechanism >>>> only allows for very limited customization, making for instance the masked >>>> arrays unable to work with the native ufuncs in a full and proper way. I >>>> would like to deprecate the current mechanism, in particular >>>> __array_prepare__ and __array_wrap__, and introduce a new method I will >>>> describe below. If you've ever used these mechanisms, please review this >>>> design to see if it meets your needs. >>>> >>>> >>>> >>> The current approach is at a dead end, so something better needs to be >>> done. >>> >>> >>>> Any class type which would like to override its behavior in ufuncs would >>>> define a method called _numpy_ufunc_, and optionally an attribute >>>> __array_priority__ as can already be done. The class which wins the priority >>>> battle gets its _numpy_ufunc_ function called as follows: >>>> >>>> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) >>>> >>>> >>>> To support this overloading, the ufunc would get a new support method, >>>> result_type, and there would be a new global function, broadcast_empty_like. >>>> >>>> The function ufunc.empty_like behaves like the global np.result_type, >>>> but produces the output type or a tuple of output types specific to the >>>> ufunc, which may follow a different convention than regular arithmetic type >>>> promotion. This allows for a class to create an output array of the correct >>>> type to pass to the ufunc if it needs to be different than the default. >>>> >>>> The function broadcast_empty_like is just like empty_like, but takes a >>>> list or tuple of arrays which are to be broadcast together for producing the >>>> output, instead of just one. >>>> >>>> >>> How does the ufunc get called so it doesn't get caught in an endless >>> loop? I like the proposed method if it can also be used for classes that >>> don't subclass ndarray. Masked array, for instance, should probably not >>> subclass ndarray. >>> >> >> The function being called needs to ensure this, either by extracting a raw >> ndarray from instances of its class, or adding a 'subok = False' parameter >> to the kwargs. Supporting objects that aren't ndarray subclasses is one of >> the purposes for this approach, and neither of my two example cases >> subclassed ndarray. >> >> > Sounds good. Many of the current uses of __array_wrap__ that I am aware of > are in the wrappers in the linalg module and don't go through the ufunc > machinery. How would that be handled? > Those could stay as they are, and just the ufunc usage of __array_wrap__ can be deprecated. For classes which currently use __array_wrap__, they would just need to also implement _numpy_ufunc_ to eliminate any deprecation messages. -Mark > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 21 17:45:12 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 21 Jun 2011 16:45:12 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 1:46 PM, Darren Dale wrote: > On Tue, Jun 21, 2011 at 2:28 PM, Charles R Harris > wrote: > > > > > > On Tue, Jun 21, 2011 at 11:57 AM, Mark Wiebe wrote: > >> > >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris > >> wrote: > >>> > >>> > >>> On Mon, Jun 20, 2011 at 12:32 PM, Mark Wiebe > wrote: > >>>> > >>>> NumPy has a mechanism built in to allow subclasses to adjust or > override > >>>> aspects of the ufunc behavior. While this goal is important, this > mechanism > >>>> only allows for very limited customization, making for instance the > masked > >>>> arrays unable to work with the native ufuncs in a full and proper way. > I > >>>> would like to deprecate the current mechanism, in particular > >>>> __array_prepare__ and __array_wrap__, and introduce a new method I > will > >>>> describe below. If you've ever used these mechanisms, please review > this > >>>> design to see if it meets your needs. > >>>> > >>> > >>> The current approach is at a dead end, so something better needs to be > >>> done. > >>> > >>>> > >>>> Any class type which would like to override its behavior in ufuncs > would > >>>> define a method called _numpy_ufunc_, and optionally an attribute > >>>> __array_priority__ as can already be done. The class which wins the > priority > >>>> battle gets its _numpy_ufunc_ function called as follows: > >>>> > >>>> return arr._numpy_ufunc_(current_ufunc, *args, **kwargs) > >>>> > >>>> To support this overloading, the ufunc would get a new support method, > >>>> result_type, and there would be a new global function, > broadcast_empty_like. > >>>> The function ufunc.empty_like behaves like the global np.result_type, > >>>> but produces the output type or a tuple of output types specific to > the > >>>> ufunc, which may follow a different convention than regular arithmetic > type > >>>> promotion. This allows for a class to create an output array of the > correct > >>>> type to pass to the ufunc if it needs to be different than the > default. > >>>> The function broadcast_empty_like is just like empty_like, but takes a > >>>> list or tuple of arrays which are to be broadcast together for > producing the > >>>> output, instead of just one. > >>> > >>> How does the ufunc get called so it doesn't get caught in an endless > >>> loop? I like the proposed method if it can also be used for classes > that > >>> don't subclass ndarray. Masked array, for instance, should probably not > >>> subclass ndarray. > >> > >> The function being called needs to ensure this, either by extracting a > raw > >> ndarray from instances of its class, or adding a 'subok = False' > parameter > >> to the kwargs. Supporting objects that aren't ndarray subclasses is one > of > >> the purposes for this approach, and neither of my two example cases > >> subclassed ndarray. > > > > Sounds good. Many of the current uses of __array_wrap__ that I am aware > of > > are in the wrappers in the linalg module and don't go through the ufunc > > machinery. How would that be handled? > > I contributed the __array_prepare__ method a while back so classes > could raise errors before the array data is modified in place. > Specifically, I was concerned about units support in my quantities > package (http://pypi.python.org/pypi/quantities). But I agree that > this approach is needs to be reconsidered. It would be nice for > subclasses to have an opportunity to intercept and process the values > passed to a ufunc on their way in. For example, it would be nice if > when I did np.cos(1.5 degrees), my subclass could intercept the value > and pass a new one on to the ufunc machinery that is expressed in > radians. I thought PJ Eby's generic functions PEP would be a really > good way to handle ufuncs, but the PEP has stagnated. > I made one of my examples overriding sin with degrees, because I think this overloading method can work well for a physical quantities library. -Mark > > Darren > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.flint at gmail.com Tue Jun 21 20:09:17 2011 From: alex.flint at gmail.com (Alex Flint) Date: Tue, 21 Jun 2011 20:09:17 -0400 Subject: [Numpy-discussion] fast SSD Message-ID: Is there a fast way to compute an array of sum-of-squared-differences between a (small) K x K array and all K x K sub-arrays of a larger array? (i.e. each element x,y in the output array is the SSD between the small array and the sub-array (x:x+K, y:y+K) My current implementation loops over each sub-array and computes the SSD with something like ((A-B)**2).sum(). Cheers, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Tue Jun 21 20:25:39 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Tue, 21 Jun 2011 17:25:39 -0700 Subject: [Numpy-discussion] fast SSD In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 5:09 PM, Alex Flint wrote: > Is there a fast way to compute an array of sum-of-squared-differences > between a (small) ?K x K array and all K x K sub-arrays of a larger array? > (i.e. each element x,y in the output array is the SSD between the small > array and the sub-array (x:x+K, y:y+K) > My current implementation loops over each sub-array and computes the SSD > with something like ((A-B)**2).sum(). I don't know of a clever way. But if ((A-B)**2).sum() is a sizable fraction of the time, then you could use bottleneck: >> a = np.random.rand(5,5) >> timeit (a**2).sum() 100000 loops, best of 3: 4.63 us per loop >> import bottleneck as bn >> timeit bn.ss(a) 1000000 loops, best of 3: 1.77 us per loop >> func, b = bn.func.ss_selector(a, axis=None) >> func >> timeit func(b) 1000000 loops, best of 3: 830 ns per loop From warren.weckesser at enthought.com Tue Jun 21 20:46:21 2011 From: warren.weckesser at enthought.com (Warren Weckesser) Date: Tue, 21 Jun 2011 19:46:21 -0500 Subject: [Numpy-discussion] fast SSD In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 7:09 PM, Alex Flint wrote: > Is there a fast way to compute an array of sum-of-squared-differences > between a (small) K x K array and all K x K sub-arrays of a larger array? > (i.e. each element x,y in the output array is the SSD between the small > array and the sub-array (x:x+K, y:y+K) > > My current implementation loops over each sub-array and computes the SSD > with something like ((A-B)**2).sum(). > You can use stride tricks and broadcasting: ----- import numpy as np from numpy.lib.stride_tricks import as_strided a = np.arange(24).reshape(4,6) k = np.array([[1,2,0],[2,1,0],[0,1,1]]) # If `a` has shape (4,6), then `b` will have shape (2, 4, 3, 3). # b[i,j] will be the 2-D sub-array of `a` with shape (3, 3). b = as_strided(a, shape=(a.shape[0]-k.shape[0]+1, a.shape[1]-k.shape[1]+1) + k.shape, strides=a.strides * 2) ssd = ((b - k)**2).sum(-1).sum(-1) print a print k print ssd ----- It's a neat trick, but be aware that the temporary result b - k will be nine times the size of a. If a is large, this might be unacceptable. Warren > Cheers, > Alex > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ischnell at enthought.com Tue Jun 21 21:15:18 2011 From: ischnell at enthought.com (Ilan Schnell) Date: Tue, 21 Jun 2011 20:15:18 -0500 Subject: [Numpy-discussion] ANN: ETS 4.0 released Message-ID: Hello, I am happy to announce the release of ETS 4.0. This is the first major release of the Enthought Tool Suite in almost three years. This release removes the 'enthought' namespace from all projects. For example: from enthought.traits.api import HasTraits is now simply:: from traits.api import HasTraits For backwards compatibility, a proxy package 'etsproxy' has been added, which should permit existing code to work. For convenience this package also contains a refactor tool 'ets3to4' to convert projects to the new namespace (so that they don't rely on the 'etsproxy' package). If you want to download the source code of all ETS projects, you can download http://www.enthought.com/repo/ets/ALLETS-4.0.0.tar (41MB). The projects themselves are now hosted on: https://github.com/enthought We understand that the namespace refactor (which prompted this major release in the first place) is a big change, and even though we have tested examples and some of our own code against this ETS version, we expect there to be little glitches. We are therefore already planning a 4.0.1 bug-fix release in about 2-3 weeks. We are looking forward to your feedback (the development mailing list is enthought-dev at enthought.com), and hope you enjoy ETS 4. - Ilan From bsouthey at gmail.com Tue Jun 21 22:31:30 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Jun 2011 21:31:30 -0500 Subject: [Numpy-discussion] ANN: Numpy 1.6.1 release candidate 1 In-Reply-To: References: <7C5D3F17-A538-4F33-B8DD-776AE125BC90@astro.physik.uni-goettingen.de> <9FFE6E5F-2E74-4AAD-8AA4-9E80DDCDDD77@astro.physik.uni-goettingen.de> <4DF635F2.8030406@gmail.com> <4DFF967B.7090407@gmail.com> <4E00ACD7.5000504@gmail.com> Message-ID: On Tue, Jun 21, 2011 at 3:52 PM, Ralf Gommers wrote: > > > On Tue, Jun 21, 2011 at 10:05 PM, Ralf Gommers > wrote: >> >> >> On Tue, Jun 21, 2011 at 4:38 PM, Bruce Southey wrote: >>> >>> On 06/21/2011 01:01 AM, Ralf Gommers wrote: >>> >>> On Tue, Jun 21, 2011 at 3:55 AM, Bruce Southey >>> wrote: >>> >>>> >>>> So what is the actual name of the multiarray shared library with the >>>> Mac? >>>> If it is ' 'multiarray.so' then the correct name is "libname + >>>> sysconfig.get_config_var('SO')" as I previously indicated. >>>> >>> It is, and yes that's correct. Orthogonal to the actual issue though. >>> >>> Ralf >>> >>> >>> While the test now pass, you have now changed an API for load_library. >> >> Only in a backwards-compatible way, which should be fine. I added a >> keyword, the default of which does the same as before. The only thing I did >> other than that was remove tries with clearly invalid extensions, like >> ".pyd" on Linux. Now that I'm writing that though, I think it's better to >> try both ".so" and ".cpython-32mu.so" by default for python >=3.2. > > This should try both extensions: > https://github.com/rgommers/numpy/tree/sharedlibext > Does that look better? > > Ralf > >> >> >>> >>> This is not something that is meant to occur in a bug-fix release as well >>> as the new argument is undocumented. But I do not understand the need for >>> this extra complexity when "libname + sysconfig.get_config_var('SO')" works >>> on Linux, Windows and Mac. >>> >> I've tried to explain this twice already. You have both >> "multiarray.cpython-32mu.so" and "liblapack.so" (or some other library like >> that) on your system. The extension of both is needed, and always obtained >> via get_config_var('SO'). See the problem? >> >> If someone knows a better way to do this, I'm all for it. But I don't see >> a simpler way. >> >> Cheers, >> Ralf >> Okay, I see what you are getting at: There are various Python bindings/interface to C libraries (Fortran in Scipy) such as lapack and atlas created by a non-Python source but there is also an 'shared library' created with Python as part of an extension. The PEP does not appear to differentiate between these - probably because it assumes non-Python libraries are loaded differently (ctypes is Python 2.5+). So there has to be a different way to find the extension. Bruce From chaoyuejoy at gmail.com Wed Jun 22 03:15:21 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Wed, 22 Jun 2011 09:15:21 +0200 Subject: [Numpy-discussion] what python module to handle csv? In-Reply-To: References: Message-ID: Thank you Olivier and Abie, I'll try that~~ Best wishes, Chao 2011/6/21 A. Flaxman > I am a huge fan of rec2csv and csv2rec, which might not technically be > part of numpy, and can be found in pylab or the matplotlib.mlab module.*** > * > > ** ** > > --Abie**** > > ** ** > > ** ** > > *From:* numpy-discussion-bounces at scipy.org [mailto: > numpy-discussion-bounces at scipy.org] *On Behalf Of *Olivier Delalleau > *Sent:* Wednesday, June 15, 2011 9:26 AM > *To:* Discussion of Numerical Python > *Subject:* Re: [Numpy-discussion] what python module to handle csv?**** > > ** ** > > Using savetxt with delimiter=',' should do the trick. > > If you want a more advanced csv interface to e.g. save more than a numpy > array into a single csv, you can probably look into the python csv module. > > -=- Olivier**** > > 2011/6/15 Chao YUE **** > > Dear all pythoners, > > what do you use python module to handle csv file (for reading we can use > numpy.genfromtxt)? is there anyone we can do with csv file very convinient > as that in R? > can numpy.genfromtxt be used as writing? (I didn't try this yet because on > our server we have only numpy 1.0.1...). This really make me struggling > since csv is a very > important interface file (I think so...). > > Thanks a lot, > > Sincerely, > > Chao > > -- **** > > > *********************************************************************************** > **** > > Chao YUE > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > UMR 1572 CEA-CNRS-UVSQ > Batiment 712 - Pe 119 > 91191 GIF Sur YVETTE Cedex**** > > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16**** > > > ************************************************************************************ > **** > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion**** > > ** ** > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pav at iki.fi Wed Jun 22 06:08:49 2011 From: pav at iki.fi (Pauli Virtanen) Date: Wed, 22 Jun 2011 10:08:49 +0000 (UTC) Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs References: Message-ID: Tue, 21 Jun 2011 16:43:13 -0500, Mark Wiebe wrote: [clip: __array_wrap__] > Those could stay as they are, and just the ufunc usage of __array_wrap__ > can be deprecated. For classes which currently use __array_wrap__, they > would just need to also implement _numpy_ufunc_ to eliminate any > deprecation messages. Do you mean that the new mechanism would not be able to do the same thing here? Preservation of array subclasses in linalg functions is not very uniform, and will likely need fixes in several of the functions. Since new code in any case would need to be written, I'd prefer using the "new" approach and so leaving us the option of marking the "old" approach deprecated. Pauli From dsdale24 at gmail.com Wed Jun 22 07:49:51 2011 From: dsdale24 at gmail.com (Darren Dale) Date: Wed, 22 Jun 2011 07:49:51 -0400 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 1:57 PM, Mark Wiebe wrote: > On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris > wrote: >> How does the ufunc get called so it doesn't get caught in an endless loop? [...] > The function being called needs to ensure this, either by extracting a raw > ndarray from instances of its class, or adding a 'subok = False' parameter > to the kwargs. I didn't understand, could you please expand on that or show an example? Darren From xscript at gmx.net Wed Jun 22 08:34:26 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Wed, 22 Jun 2011 14:34:26 +0200 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: (Darren Dale's message of "Wed, 22 Jun 2011 07:49:51 -0400") References: Message-ID: <87hb7i3uzx.fsf@ginnungagap.bsc.es> Darren Dale writes: > On Tue, Jun 21, 2011 at 1:57 PM, Mark Wiebe wrote: >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris >> wrote: >>> How does the ufunc get called so it doesn't get caught in an endless loop? > [...] >> The function being called needs to ensure this, either by extracting a raw >> ndarray from instances of its class, or adding a 'subok = False' parameter >> to the kwargs. > I didn't understand, could you please expand on that or show an example? As I understood the initial description and examples, the ufunc overload will keep being used as long as its arguments are of classes that declare ufunc overrides (i.e., classes with the "_numpy_ufunc_" attribute). Thus Mark's comment saying that you have to either transform the arguments into raw ndarrays (either by creating new ones or passing a view) or use the "subok = False" kwarg parameter to break a possible overloading loop. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From mwwiebe at gmail.com Wed Jun 22 09:59:08 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 08:59:08 -0500 Subject: [Numpy-discussion] datetime pull request for review and build/test In-Reply-To: References: Message-ID: This has been reviewed, I'm proceeding with a merge. -Mark On Mon, Jun 20, 2011 at 7:38 PM, Mark Wiebe wrote: > https://github.com/numpy/numpy/pull/93 > > The summary: > > * Tighten up date unit vs time unit casting rules, and integrate the > NPY_CASTING enum deeper into the datetime conversions > * Determine a unit when converting from a string array, similar to when > converting from lists of strings > * Switch local/utc handling to a timezone= parameter, which also accepts a > datetime.tzinfo object. This, for example, enables the use of the pytz > library with numpy.datetime64 > * Remove the events metadata, make the old API functions raise exceptions, > and rename the "frequency" metadata name to "timeunit" > * Abstract the flexible dtype mechanism into a function, so that it can be > more easily changed without having to add code to many places > > -Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 22 10:24:28 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 09:24:28 -0500 Subject: [Numpy-discussion] review/test pull request for tightening default ufunc casting rule Message-ID: Pull request is here: https://github.com/numpy/numpy/pull/95 -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.flint at gmail.com Wed Jun 22 11:30:47 2011 From: alex.flint at gmail.com (Alex Flint) Date: Wed, 22 Jun 2011 11:30:47 -0400 Subject: [Numpy-discussion] argmax for top N elements Message-ID: Is it possible to use argmax or something similar to find the locations of the largest N elements in a matrix? -------------- next part -------------- An HTML attachment was scrubbed... URL: From e.antero.tammi at gmail.com Wed Jun 22 11:44:28 2011 From: e.antero.tammi at gmail.com (eat) Date: Wed, 22 Jun 2011 18:44:28 +0300 Subject: [Numpy-discussion] argmax for top N elements In-Reply-To: References: Message-ID: Hi, On Wed, Jun 22, 2011 at 6:30 PM, Alex Flint wrote: > Is it possible to use argmax or something similar to find the locations of > the largest N elements in a matrix? How bout argsort(.)?, Like: In []: a= arange(9) In []: a.argsort()[::-1][:3] Out[]: array([8, 7, 6]) My 2 cents, eat > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 22 13:21:59 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 12:21:59 -0500 Subject: [Numpy-discussion] struct dtype cleanup for testing and review Message-ID: This set of patches cleans up struct dtypes quite a bit and generalizes them to allow overlapping and out-of-order fields with non-object dtypes. The pull request is here: https://github.com/numpy/numpy/pull/94 The commit with documentation changes is here: https://github.com/m-paradox/numpy/commit/b23337999abcb3ecfa648d86f0bf049ef7e58d3e This fixes ticket (http://projects.scipy.org/numpy/ticket/1790), and some of the problems discussed in ticket ( http://projects.scipy.org/numpy/ticket/1619). -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 22 13:25:53 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 12:25:53 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 5:08 AM, Pauli Virtanen wrote: > Tue, 21 Jun 2011 16:43:13 -0500, Mark Wiebe wrote: > [clip: __array_wrap__] > > Those could stay as they are, and just the ufunc usage of __array_wrap__ > > can be deprecated. For classes which currently use __array_wrap__, they > > would just need to also implement _numpy_ufunc_ to eliminate any > > deprecation messages. > > Do you mean that the new mechanism would not be able to do > the same thing here? > It would have to be generalized a bit more to support these usages, because some functions produce outputs with different shapes, and the inputs may not be broadcast together in the same manner as in the element-wise ufuncs. > Preservation of array subclasses in linalg functions is not very > uniform, and will likely need fixes in several of the functions. > Since new code in any case would need to be written, I'd prefer > using the "new" approach and so leaving us the option of marking > the "old" approach deprecated. > I think creating a @ufunc_overload(...) decorator for specifying the ufunc properties like nin and nout might be a nice way to generalize it. -Mark > > Pauli > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 22 13:31:05 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 12:31:05 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: <87hb7i3uzx.fsf@ginnungagap.bsc.es> References: <87hb7i3uzx.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 22, 2011 at 7:34 AM, Llu?s wrote: > Darren Dale writes: > > > On Tue, Jun 21, 2011 at 1:57 PM, Mark Wiebe wrote: > >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris > >> wrote: > >>> How does the ufunc get called so it doesn't get caught in an endless > loop? > > > [...] > > >> The function being called needs to ensure this, either by extracting a > raw > >> ndarray from instances of its class, or adding a 'subok = False' > parameter > >> to the kwargs. > > > I didn't understand, could you please expand on that or show an example? > > As I understood the initial description and examples, the ufunc overload > will keep being used as long as its arguments are of classes that > declare ufunc overrides (i.e., classes with the "_numpy_ufunc_" > attribute). > > Thus Mark's comment saying that you have to either transform the > arguments into raw ndarrays (either by creating new ones or passing a > view) or use the "subok = False" kwarg parameter to break a possible > overloading loop. > The sequence of events is something like this: 1. You call np.sin(x) 2. The np.sin ufunc looks at x, sees the _numpy_ufunc_ attribute, and calls x._numpy_ufunc_(np.sin, x) 3. _numpy_ufunc_ uses np.sin.name (which is "sin") to get the correct my_sin function to call 4A. If my_sin called np.sin(x), we would go back to 1. and get an infinite loop 4B. If x is a subclass of ndarray, my_sin can call np.sin(x, subok=False), as this disables the subclass overloading mechanism. 4C. If x is not a subclass of ndarray, x needs to produce an ndarray, for instance it might have an x.arr property. Then it can call np.sin(x.arr) -Mark > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 22 13:49:02 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 22 Jun 2011 11:49:02 -0600 Subject: [Numpy-discussion] struct dtype cleanup for testing and review In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 11:21 AM, Mark Wiebe wrote: > This set of patches cleans up struct dtypes quite a bit and generalizes > them to allow overlapping and out-of-order fields with non-object dtypes. > The pull request is here: > > Overlapping, as in unions? > https://github.com/numpy/numpy/pull/94 > > The commit with documentation changes is here: > > > https://github.com/m-paradox/numpy/commit/b23337999abcb3ecfa648d86f0bf049ef7e58d3e > > This fixes ticket (http://projects.scipy.org/numpy/ticket/1790), and some > of the problems discussed in ticket ( > http://projects.scipy.org/numpy/ticket/1619). > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 22 14:07:31 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 13:07:31 -0500 Subject: [Numpy-discussion] struct dtype cleanup for testing and review In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 12:49 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > > On Wed, Jun 22, 2011 at 11:21 AM, Mark Wiebe wrote: > >> This set of patches cleans up struct dtypes quite a bit and generalizes >> them to allow overlapping and out-of-order fields with non-object dtypes. >> The pull request is here: >> >> > Overlapping, as in unions? > That's correct. -Mark > > >> https://github.com/numpy/numpy/pull/94 >> >> The commit with documentation changes is here: >> >> >> https://github.com/m-paradox/numpy/commit/b23337999abcb3ecfa648d86f0bf049ef7e58d3e >> >> This fixes ticket (http://projects.scipy.org/numpy/ticket/1790), and some >> of the problems discussed in ticket ( >> http://projects.scipy.org/numpy/ticket/1619). >> >> > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Wed Jun 22 14:33:58 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 22 Jun 2011 14:33:58 -0400 Subject: [Numpy-discussion] write access through iterator Message-ID: Maybe I'm being dense today, but I don't see how to iterate over arrays with write access. You could read through iterators like: fl = u.flat >>> for item in fl: ... print item but you can't do for item in fl: item = 10 (or, it won't do what you want). Is there any way to do this? From mwwiebe at gmail.com Wed Jun 22 14:39:48 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 13:39:48 -0500 Subject: [Numpy-discussion] write access through iterator In-Reply-To: References: Message-ID: You can do something pretty close to that with the 1.6 iterator: >>> u = np.arange(20).reshape(5,4)[:,:3] >>> u array([[ 0, 1, 2], [ 4, 5, 6], [ 8, 9, 10], [12, 13, 14], [16, 17, 18]]) >>> for item in np.nditer(u, [], ['readwrite'], order='C'): ... item[...] = 10 ... >>> u array([[10, 10, 10], [10, 10, 10], [10, 10, 10], [10, 10, 10], [10, 10, 10]]) -Mark On Wed, Jun 22, 2011 at 1:33 PM, Neal Becker wrote: > Maybe I'm being dense today, but I don't see how to iterate over arrays > with > write access. You could read through iterators like: > > fl = u.flat > >>> for item in fl: > ... print item > > but you can't do > > for item in fl: > item = 10 > > (or, it won't do what you want). > > Is there any way to do this? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From RadimRehurek at seznam.cz Wed Jun 22 15:08:25 2011 From: RadimRehurek at seznam.cz (RadimRehurek) Date: Wed, 22 Jun 2011 21:08:25 +0200 (CEST) Subject: [Numpy-discussion] argmax for top N elements In-Reply-To: Message-ID: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> > Date: Wed, 22 Jun 2011 11:30:47 -0400 > From: Alex Flint > Subject: [Numpy-discussion] argmax for top N elements > > Is it possible to use argmax or something similar to find the locations of > the largest N elements in a matrix? I would also be interested in an O(N) argmax/argmin for indices of top k values in an array. I'm currently using argsort[:k] in a performance sensitive part and it's not ideal. Best, Radim From olivier.grisel at ensta.org Wed Jun 22 15:10:34 2011 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 22 Jun 2011 21:10:34 +0200 Subject: [Numpy-discussion] argmax for top N elements In-Reply-To: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> References: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> Message-ID: 2011/6/22 RadimRehurek : >> Date: Wed, 22 Jun 2011 11:30:47 -0400 >> From: Alex Flint >> Subject: [Numpy-discussion] argmax for top N elements >> >> Is it possible to use argmax or something similar to find the locations of >> the largest N elements in a matrix? > > I would also be interested in an O(N) argmax/argmin for indices of top k values in an array. I'm currently using argsort[:k] in a performance sensitive part and it's not ideal. +1 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From ben.root at ou.edu Wed Jun 22 15:11:31 2011 From: ben.root at ou.edu (Benjamin Root) Date: Wed, 22 Jun 2011 14:11:31 -0500 Subject: [Numpy-discussion] argmax for top N elements In-Reply-To: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> References: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> Message-ID: On Wed, Jun 22, 2011 at 2:08 PM, RadimRehurek wrote: > > Date: Wed, 22 Jun 2011 11:30:47 -0400 > > From: Alex Flint > > Subject: [Numpy-discussion] argmax for top N elements > > > > Is it possible to use argmax or something similar to find the locations > of > > the largest N elements in a matrix? > > I would also be interested in an O(N) argmax/argmin for indices of top k > values in an array. I'm currently using argsort[:k] in a performance > sensitive part and it's not ideal. > > Best, > Radim > I think there were some discussions recently for new features to add to the next release of Numpy, and I think this was one of them. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Wed Jun 22 15:18:22 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Wed, 22 Jun 2011 12:18:22 -0700 Subject: [Numpy-discussion] argmax for top N elements In-Reply-To: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> References: <27019.7617.18283-25286-1677204361-1308769705@seznam.cz> Message-ID: On Wed, Jun 22, 2011 at 12:08 PM, RadimRehurek wrote: >> Date: Wed, 22 Jun 2011 11:30:47 -0400 >> From: Alex Flint >> Subject: [Numpy-discussion] argmax for top N elements >> >> Is it possible to use argmax or something similar to find the locations of >> the largest N elements in a matrix? > > I would also be interested in an O(N) argmax/argmin for indices of top k values in an array. I'm currently using argsort[:k] in a performance sensitive part and it's not ideal. > You can try argpartsort from the bottleneck package: >> a = np.random.rand(100000) >> k = 10 >> timeit a.argsort()[:k] 100 loops, best of 3: 11.2 ms per loop >> import bottleneck as bn >> timeit bn.argpartsort(a, k)[:k] 1000 loops, best of 3: 1.29 ms per loop Output of argpartsort is not ordered: >> a.argsort()[:k] array([97239, 21091, 25130, 41638, 62323, 57419, 4381, 34905, 94572, 25935]) >> bn.argpartsort(a, k)[:k] array([21091, 62323, 25130, 97239, 41638, 57419, 4381, 34905, 94572, 25935]) From dsdale24 at gmail.com Wed Jun 22 17:57:37 2011 From: dsdale24 at gmail.com (Darren Dale) Date: Wed, 22 Jun 2011 17:57:37 -0400 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: <87hb7i3uzx.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 22, 2011 at 1:31 PM, Mark Wiebe wrote: > On Wed, Jun 22, 2011 at 7:34 AM, Llu?s wrote: >> >> Darren Dale writes: >> >> > On Tue, Jun 21, 2011 at 1:57 PM, Mark Wiebe wrote: >> >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris >> >> wrote: >> >>> How does the ufunc get called so it doesn't get caught in an endless >> >>> loop? >> >> > [...] >> >> >> The function being called needs to ensure this, either by extracting a >> >> raw >> >> ndarray from instances of its class, or adding a 'subok = False' >> >> parameter >> >> to the kwargs. >> >> > I didn't understand, could you please expand on that or show an example? >> >> As I understood the initial description and examples, the ufunc overload >> will keep being used as long as its arguments are of classes that >> declare ufunc overrides (i.e., classes with the "_numpy_ufunc_" >> attribute). >> >> Thus Mark's comment saying that you have to either transform the >> arguments into raw ndarrays (either by creating new ones or passing a >> view) or use the "subok = False" kwarg parameter to break a possible >> overloading loop. > > The sequence of events is something like this: > 1. You call np.sin(x) > 2. The np.sin ufunc looks at x, sees the _numpy_ufunc_ attribute, and calls > x._numpy_ufunc_(np.sin, x) > 3. _numpy_ufunc_ uses np.sin.name (which is "sin") to get the correct my_sin > function to call > 4A. If my_sin called np.sin(x), we would go back to 1. and get an infinite > loop > 4B. If x is a subclass of ndarray, my_sin can call np.sin(x, subok=False), > as this disables the subclass overloading mechanism. > 4C. If x is not a subclass of ndarray, x needs to produce an ndarray, for > instance it might have an x.arr property. Then it can call np.sin(x.arr) Ok, that seems straightforward and, for what its worth, it looks like it would meet my needs. However, I wonder if the _numpy_func_ mechanism is the best approach. This is a bit sketchy, but why not do something like: class Proxy: def __init__(self, ufunc, *args): self._ufunc = ufunc self._args = args def __call__(self, func): self._ufunc._registry[tuple(type(arg) for arg in self._args)] = func return func class UfuncObject: ... def __call__(self, *args, **kwargs): func = self._registry.get(tuple(type(arg) for arg in args), None) if func is None: raise TypeError return func(*args, **kwargs) def register(self, *args): return Proxy(self, *args) @np.sin.register(Quantity) def sin(pq): if pq.units != degrees: pq = pq.rescale(degrees) temp = np.sin(pq.view(np.ndarray)) return Quantity(temp, copy=False) This way, classes don't have to implement special methods to support ufunc registration, special attributes to claim primacy in ufunc registration lookup, special versions of the functions for each numpy ufunc, *and* the logic to determine whether the combination of arguments is supported. By that I mean, if I call np.sum with a quantity and a masked array, and Quantity wins the __array_priority__ competition, then I also need to check that my special sum function(s) know how to operate on that combination of inputs. With the decorator approach, I just need to implement the special versions of the ufuncs, and the decorators handle the logic of knowing what combinations of arguments are supported. It might be worth considering using ABCs for registration and have UfuncObject use isinstance to determine the appropriate special function to call. Darren From mwwiebe at gmail.com Wed Jun 22 18:56:29 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 22 Jun 2011 17:56:29 -0500 Subject: [Numpy-discussion] replacing the mechanism for dispatching ufuncs In-Reply-To: References: <87hb7i3uzx.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 22, 2011 at 4:57 PM, Darren Dale wrote: > On Wed, Jun 22, 2011 at 1:31 PM, Mark Wiebe wrote: > > On Wed, Jun 22, 2011 at 7:34 AM, Llu?s wrote: > >> > >> Darren Dale writes: > >> > >> > On Tue, Jun 21, 2011 at 1:57 PM, Mark Wiebe > wrote: > >> >> On Tue, Jun 21, 2011 at 12:36 PM, Charles R Harris > >> >> wrote: > >> >>> How does the ufunc get called so it doesn't get caught in an endless > >> >>> loop? > >> > >> > [...] > >> > >> >> The function being called needs to ensure this, either by extracting > a > >> >> raw > >> >> ndarray from instances of its class, or adding a 'subok = False' > >> >> parameter > >> >> to the kwargs. > >> > >> > I didn't understand, could you please expand on that or show an > example? > >> > >> As I understood the initial description and examples, the ufunc overload > >> will keep being used as long as its arguments are of classes that > >> declare ufunc overrides (i.e., classes with the "_numpy_ufunc_" > >> attribute). > >> > >> Thus Mark's comment saying that you have to either transform the > >> arguments into raw ndarrays (either by creating new ones or passing a > >> view) or use the "subok = False" kwarg parameter to break a possible > >> overloading loop. > > > > The sequence of events is something like this: > > 1. You call np.sin(x) > > 2. The np.sin ufunc looks at x, sees the _numpy_ufunc_ attribute, and > calls > > x._numpy_ufunc_(np.sin, x) > > 3. _numpy_ufunc_ uses np.sin.name (which is "sin") to get the correct > my_sin > > function to call > > 4A. If my_sin called np.sin(x), we would go back to 1. and get an > infinite > > loop > > 4B. If x is a subclass of ndarray, my_sin can call np.sin(x, > subok=False), > > as this disables the subclass overloading mechanism. > > 4C. If x is not a subclass of ndarray, x needs to produce an ndarray, for > > instance it might have an x.arr property. Then it can call np.sin(x.arr) > > Ok, that seems straightforward and, for what its worth, it looks like > it would meet my needs. However, I wonder if the _numpy_func_ > mechanism is the best approach. This is a bit sketchy, but why not do > something like: > > class Proxy: > > def __init__(self, ufunc, *args): > self._ufunc = ufunc > self._args = args > > def __call__(self, func): > self._ufunc._registry[tuple(type(arg) for arg in self._args)] = func > return func > > > class UfuncObject: > > ... > > def __call__(self, *args, **kwargs): > func = self._registry.get(tuple(type(arg) for arg in args), None) > if func is None: > raise TypeError > return func(*args, **kwargs) > > def register(self, *args): > return Proxy(self, *args) > > > @np.sin.register(Quantity) > def sin(pq): > if pq.units != degrees: > pq = pq.rescale(degrees) > temp = np.sin(pq.view(np.ndarray)) > return Quantity(temp, copy=False) > > This way, classes don't have to implement special methods to support > ufunc registration, special attributes to claim primacy in ufunc > registration lookup, special versions of the functions for each numpy > ufunc, *and* the logic to determine whether the combination of > arguments is supported. By that I mean, if I call np.sum with a > quantity and a masked array, and Quantity wins the __array_priority__ > competition, then I also need to check that my special sum function(s) > know how to operate on that combination of inputs. With the decorator > approach, I just need to implement the special versions of the ufuncs, > and the decorators handle the logic of knowing what combinations of > arguments are supported. > > It might be worth considering using ABCs for registration and have > UfuncObject use isinstance to determine the appropriate special > function to call. > The thing I'm not sure about with this idea is how to do something general to modify all ufuncs for a particular class in one swoop. Having to individually override everything would be a hassle, I think. I also don't think this approach saves very much effort compared to the _numpy_ufunc_ call approach, checking the types and raising NotImplemented if they aren't what's wanted is pretty much the only difference, and that check could still be handled by a decorator like you're showing. -Mark > > Darren > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 00:49:21 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 22 Jun 2011 22:49:21 -0600 Subject: [Numpy-discussion] Python plugins for gcc static analysis Message-ID: Thought gcc-python-pluginmight be of interest to some. One of the motivations seems to have been checking reference handling in cpython. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Jun 23 08:02:24 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 23 Jun 2011 08:02:24 -0400 Subject: [Numpy-discussion] A bit of advice? Message-ID: I have a set of experiments that I want to plot. There will be many plots. Each will show different test conditions. Suppose I put each of the test conditions and results into a recarray. The recarray could be: arr = np.empty ((#experiments,), dtype=[('x0',int), ('x1',int), ('y0',int)] where x0,x1 are 2 test conditions, and y0 is a result. First I want to group the plots such according to the test conditions. So, I want to first find what all the combinations of test conditions are. Dunno if there is anything simpler than: cases = tuple (set ((e['x0'], e['x1'])) for e in arr) Next, need to select all those experiments which match each of these cases. Now I know of no easy way. Any suggestions? Perhaps I should look again at pytables? From shish at keba.be Thu Jun 23 08:10:46 2011 From: shish at keba.be (Olivier Delalleau) Date: Thu, 23 Jun 2011 08:10:46 -0400 Subject: [Numpy-discussion] A bit of advice? In-Reply-To: References: Message-ID: What about : dict((k, [e for e in arr if (e['x0'], e['x1']) == k]) for k in cases) ? (note: it is inefficient written this way though) -=- Olivier 2011/6/23 Neal Becker > I have a set of experiments that I want to plot. There will be many plots. > Each will show different test conditions. > > Suppose I put each of the test conditions and results into a recarray. The > recarray could be: > > arr = np.empty ((#experiments,), dtype=[('x0',int), ('x1',int), ('y0',int)] > > where x0,x1 are 2 test conditions, and y0 is a result. > > First I want to group the plots such according to the test conditions. So, > I > want to first find what all the combinations of test conditions are. > > Dunno if there is anything simpler than: > > cases = tuple (set ((e['x0'], e['x1'])) for e in arr) > > Next, need to select all those experiments which match each of these cases. > Now > I know of no easy way. > > Any suggestions? Perhaps I should look again at pytables? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Jun 23 08:20:32 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 23 Jun 2011 08:20:32 -0400 Subject: [Numpy-discussion] A bit of advice? References: Message-ID: Olivier Delalleau wrote: > What about : > dict((k, [e for e in arr if (e['x0'], e['x1']) == k]) for k in cases) > ? Not bad! Thanks! BTW, is there an easier way to get the unique keys, then this: cases = tuple (set (tuple((e['a'],e['b'])) for e in u)) > > (note: it is inefficient written this way though) > > -=- Olivier > > 2011/6/23 Neal Becker > >> I have a set of experiments that I want to plot. There will be many plots. >> Each will show different test conditions. >> >> Suppose I put each of the test conditions and results into a recarray. The >> recarray could be: >> >> arr = np.empty ((#experiments,), dtype=[('x0',int), ('x1',int), ('y0',int)] >> >> where x0,x1 are 2 test conditions, and y0 is a result. >> >> First I want to group the plots such according to the test conditions. So, >> I >> want to first find what all the combinations of test conditions are. >> >> Dunno if there is anything simpler than: >> >> cases = tuple (set ((e['x0'], e['x1'])) for e in arr) >> >> Next, need to select all those experiments which match each of these cases. >> Now >> I know of no easy way. >> >> Any suggestions? Perhaps I should look again at pytables? >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> From josef.pktd at gmail.com Thu Jun 23 09:03:46 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jun 2011 09:03:46 -0400 Subject: [Numpy-discussion] A bit of advice? In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 8:20 AM, Neal Becker wrote: > Olivier Delalleau wrote: > >> What about : >> dict((k, [e for e in arr if (e['x0'], e['x1']) == k]) for k in cases) >> ? > > Not bad! ?Thanks! > > BTW, is there an easier way to get the unique keys, then this: > > cases = tuple (set (tuple((e['a'],e['b'])) for e in u)) I think you can just combine these 2 experiments = defaultdict([]) #syntax ? for i, e in enumerate(arr): experiments[tuple((e['a'],e['b']))].append(i) #experiments[tuple((e['a'],e['b']))].append(y['c']) #or just summarize results experiments.keys() #uniques (just typed not checked) Josef > >> >> (note: it is inefficient written this way though) >> >> -=- Olivier >> >> 2011/6/23 Neal Becker >> >>> I have a set of experiments that I want to plot. ?There will be many plots. >>> Each will show different test conditions. >>> >>> Suppose I put each of the test conditions and results into a recarray. ?The >>> recarray could be: >>> >>> arr = np.empty ((#experiments,), dtype=[('x0',int), ('x1',int), ('y0',int)] >>> >>> where x0,x1 are 2 test conditions, and y0 is a result. >>> >>> First I want to group the plots such according to the test conditions. ?So, >>> I >>> want to first find what all the combinations of test conditions are. >>> >>> Dunno if there is anything simpler than: >>> >>> cases = tuple (set ((e['x0'], e['x1'])) for e in arr) >>> >>> Next, need to select all those experiments which match each of these cases. >>> ?Now >>> I know of no easy way. >>> >>> Any suggestions? ?Perhaps I should look again at pytables? >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From mwwiebe at gmail.com Thu Jun 23 10:17:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 09:17:44 -0500 Subject: [Numpy-discussion] review/test pull request for tightening default ufunc casting rule In-Reply-To: References: Message-ID: This is now merged. -Mark On Wed, Jun 22, 2011 at 9:24 AM, Mark Wiebe wrote: > Pull request is here: > > https://github.com/numpy/numpy/pull/95 > > -Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jhtu at Princeton.EDU Thu Jun 23 16:08:03 2011 From: jhtu at Princeton.EDU (Jonathan Tu) Date: Thu, 23 Jun 2011 16:08:03 -0400 Subject: [Numpy-discussion] Parallelize repeated operations Message-ID: Hi, I have found a very strange bug that I cannot understand. I would like to do something like this: Given 4 pairs of numpy arrays (x1, y1, x2, y2, x3, y3, x4, y4), I would like to compute each corresponding inner product ip = np.dot(yi.T, xj), for i = 1,...4 and j = 1,...,4. (Something like a correlation matrix.) What I did was use shared memory and the multiprocessing module (with pool) to load the data in parallel. Each processor loads one pair of the snapshots, so I can do 4 simultaneous loads on my four-core machine, and it takes two cycles of loads to get all the data in memory. Then I tried to do the inner products in parallel as well, asking each processors to do 4 of the 16 total inner products. As it turned out, this was slower in parallel than in serial!!! This only occurs when I use numpy functions. If instead I replace the inner product task by printing to stdout or something of that sort, I get the 4x speedup that I expect. Any ideas? Jonathan Tu From mwwiebe at gmail.com Thu Jun 23 16:53:31 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 15:53:31 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray Message-ID: Enthought has asked me to look into the "missing data" problem and how NumPy could treat it better. I've considered the different ideas of adding dtype variants with a special signal value and masked arrays, and concluded that adding masks to the core ndarray appears is the best way to deal with the problem in general. I've written a NEP that proposes a particular design, viewable here: https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst There are some questions at the bottom of the NEP which definitely need discussion to find the best design choices. Please read, and let me know of all the errors and gaps you find in the document. Thanks, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Jun 23 17:19:19 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jun 2011 14:19:19 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: I'd like to see a statement of what the "missing data problem" is, and how this solves it? Because I don't think this is entirely intuitive, or that everyone necessarily has the same idea. > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate as if the values weren't there For context: My experience with missing data is in statistical analysis; I find R's NA support to be pretty awesome for those purposes. The conceptual model it's based on is that an NA value is some number that we just happen not to know. So from this perspective, I find it pretty confusing that adding an unknown quantity to 3 should result in 3, rather than another unknown quantity. (Obviously it should be possible to compute the sum of the known values, but IME it's important for the default behavior to be to fail loudly when things are wonky, not to silently patch them up, possibly incorrectly!) Also, what should 'dot' do with missing values? -- Nathaniel On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > Enthought has asked me to look into the "missing data" problem and how NumPy > could treat it better. I've considered the different ideas of adding dtype > variants with a special signal value and masked arrays, and concluded that > adding masks to the core ndarray appears is the best way to deal with the > problem in general. > I've written a NEP that proposes a particular design, viewable here: > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > There are some questions at the bottom of the NEP which definitely need > discussion to find the best design choices. Please read, and let me know of > all the errors and gaps you find in the document. > Thanks, > Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From mwwiebe at gmail.com Thu Jun 23 17:37:20 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 16:37:20 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith wrote: > I'd like to see a statement of what the "missing data problem" is, and > how this solves it? Because I don't think this is entirely intuitive, > or that everyone necessarily has the same idea. > I agree it represents different problems in different contexts. For NumPy, I think the mechanism for dealing with it needs to be intuitive to work with in a maximum number of contexts, avoiding surprises. Getting feedback from a broad range of people is the only way a general solution can be designed with any level of confidence. > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate as > if the values weren't there > > For context: My experience with missing data is in statistical > analysis; I find R's NA support to be pretty awesome for those > purposes. The conceptual model it's based on is that an NA value is > some number that we just happen not to know. So from this perspective, > I find it pretty confusing that adding an unknown quantity to 3 should > result in 3, rather than another unknown quantity. (Obviously it > should be possible to compute the sum of the known values, but IME > it's important for the default behavior to be to fail loudly when > things are wonky, not to silently patch them up, possibly > incorrectly!) > The conceptual model you describe sounds reasonable to me, and I definitely like the idea of consistently following one such model for all default behaviors. > Also, what should 'dot' do with missing values? > A matrix multiplication is defined in terms of sums of products, so it can be implemented to behave consistently with your conceptual model. > > -- Nathaniel > > On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > > Enthought has asked me to look into the "missing data" problem and how > NumPy > > could treat it better. I've considered the different ideas of adding > dtype > > variants with a special signal value and masked arrays, and concluded > that > > adding masks to the core ndarray appears is the best way to deal with the > > problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > There are some questions at the bottom of the NEP which definitely need > > discussion to find the best design choices. Please read, and let me know > of > > all the errors and gaps you find in the document. > > Thanks, > > Mark > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 23 17:44:10 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 23 Jun 2011 16:44:10 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > Enthought has asked me to look into the "missing data" problem and how NumPy > could treat it better. I've considered the different ideas of adding dtype > variants with a special signal value and masked arrays, and concluded that > adding masks to the core ndarray appears is the best way to deal with the > problem in general. > I've written a NEP that proposes a particular design, viewable here: > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > There are some questions at the bottom of the NEP which definitely need > discussion to find the best design choices. Please read, and let me know of > all the errors and gaps you find in the document. One thing that could use more explanation is how your proposal improves on the status quo, i.e. numpy.ma. As far as I can see, you are mostly just shuffling around the functionality that already exists. There has been a continual desire for something like R's NA values by people who are very familiar with both R and numpy's masked arrays. Both have their uses, and as Nathaniel points out, R's approach seems to be very well-liked by a lot of users. In essence, *that's* the "missing data problem" that you were charged with: making happy the users who are currently dissatisfied with masked arrays. It doesn't seem to me that moving the functionality from numpy.ma to numpy.ndarray resolves any of their issues. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From charlesr.harris at gmail.com Thu Jun 23 17:46:30 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 15:46:30 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: > Enthought has asked me to look into the "missing data" problem and how > NumPy could treat it better. I've considered the different ideas of adding > dtype variants with a special signal value and masked arrays, and concluded > that adding masks to the core ndarray appears is the best way to deal with > the problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > There are some questions at the bottom of the NEP which definitely need > discussion to find the best design choices. Please read, and let me know of > all the errors and gaps you find in the document. > > I agree that low level support for masks is the way to go. > If all the input values are masked, 'sum' and 'prod' will produce the additive and multiplicative identities respectively A masked zero dimensional array might be another option, depending on how you handle scalars. This would also work when arrays were summed down an axis if a masked array was returned. I suppose the problem with using the word 'mask' is the implication that it hides something. Maybe 'window' would be an alternate choice, although in this context I tend to think of 'mask' as having the meaning you assign to it. Chuck > Thanks, > Mark > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Jun 23 17:48:50 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 23 Jun 2011 23:48:50 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <20110623214850.GD8017@phare.normalesup.org> On Thu, Jun 23, 2011 at 03:53:31PM -0500, Mark Wiebe wrote: > concluded that adding masks to the core ndarray appears is the best way to > deal with the problem in general. It seems to me that this is going to make the numpy array a way more complex object. Althought it is currently quite simple, that object has already a hard time getting acceptance beyond the scientific community, whereas it should really be used in many other places. Right now, the numpy array can be seen as an extension of the C array, basically a pointer, a data type, and a shape (and strides). This enables easy sharing with libraries that have not been written with numpy in mind. The limitations of the subclassing approach that you mention do not seem fundemental to me. For instance the impossibility to mix subclasses could perhaps be solved using the Mixin Pattern. Ufuncs need work, but I have the impression that your proposal is simply to solve the special case of masked data in the ufunc by breaking the simple numpy array model. By moving in the core a growing amount of functionality, it seems to me that you are going to make it more and more complex while loosing its genericity. Each new feature will need to go in the core and induce a high cost. Making inheritance and unfuncs more generic seems to me like a better investment. My 2 cents, Gael PS: I am on the verge of conference travel, so I will not be able to participate any further in the discussion. From charlesr.harris at gmail.com Thu Jun 23 17:50:28 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 15:50:28 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 3:44 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > > Enthought has asked me to look into the "missing data" problem and how > NumPy > > could treat it better. I've considered the different ideas of adding > dtype > > variants with a special signal value and masked arrays, and concluded > that > > adding masks to the core ndarray appears is the best way to deal with the > > problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > There are some questions at the bottom of the NEP which definitely need > > discussion to find the best design choices. Please read, and let me know > of > > all the errors and gaps you find in the document. > > One thing that could use more explanation is how your proposal > improves on the status quo, i.e. numpy.ma. As far as I can see, you > are mostly just shuffling around the functionality that already > exists. There has been a continual desire for something like R's NA > values by people who are very familiar with both R and numpy's masked > arrays. Both have their uses, and as Nathaniel points out, R's > approach seems to be very well-liked by a lot of users. In essence, > *that's* the "missing data problem" that you were charged with: making > happy the users who are currently dissatisfied with masked arrays. It > doesn't seem to me that moving the functionality from numpy.ma to > numpy.ndarray resolves any of their issues. > > So we are looking for unhappy users ;) Making the functionality, whatever it turns out to be, faster, more complete, and easier to use also seems a worthy goal. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 17:51:31 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 16:51:31 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 4:44 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > > Enthought has asked me to look into the "missing data" problem and how > NumPy > > could treat it better. I've considered the different ideas of adding > dtype > > variants with a special signal value and masked arrays, and concluded > that > > adding masks to the core ndarray appears is the best way to deal with the > > problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > There are some questions at the bottom of the NEP which definitely need > > discussion to find the best design choices. Please read, and let me know > of > > all the errors and gaps you find in the document. > > One thing that could use more explanation is how your proposal > improves on the status quo, i.e. numpy.ma. As far as I can see, you > are mostly just shuffling around the functionality that already > exists. Please read my proposal in more detail. numpy.ma has many problems of inconsistency with numpy which by itself makes it difficult to work with, and while I read the ma documentation, that's not my starting point. > There has been a continual desire for something like R's NA > values by people who are very familiar with both R and numpy's masked > arrays. Both have their uses, and as Nathaniel points out, R's > approach seems to be very well-liked by a lot of users. In essence, > *that's* the "missing data problem" that you were charged with: making > happy the users who are currently dissatisfied with masked arrays. I'm not intimately familiar with R, so anyone who is and has time to provide detailed feedback will be much appreciated. > It doesn't seem to me that moving the functionality from numpy.ma to > numpy.ndarray resolves any of their issues. > The design also needs to be clarified and made properly consistent, after which I would be surprised it doesn't resolve the issues. -Mark > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless > enigma that is made terrible by our own mad attempt to interpret it as > though it had an underlying truth." > -- Umberto Eco > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From efiring at hawaii.edu Thu Jun 23 17:54:45 2011 From: efiring at hawaii.edu (Eric Firing) Date: Thu, 23 Jun 2011 11:54:45 -1000 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E03B625.4000302@hawaii.edu> On 06/23/2011 11:19 AM, Nathaniel Smith wrote: > I'd like to see a statement of what the "missing data problem" is, and > how this solves it? Because I don't think this is entirely intuitive, > or that everyone necessarily has the same idea. > >> Reduction operations like 'sum', 'prod', 'min', and 'max' will operate as if the values weren't there > > For context: My experience with missing data is in statistical > analysis; I find R's NA support to be pretty awesome for those > purposes. The conceptual model it's based on is that an NA value is > some number that we just happen not to know. So from this perspective, > I find it pretty confusing that adding an unknown quantity to 3 should > result in 3, rather than another unknown quantity. (Obviously it > should be possible to compute the sum of the known values, but IME > it's important for the default behavior to be to fail loudly when > things are wonky, not to silently patch them up, possibly > incorrectly!) From the oceanographic data acquisition and analysis perspective, and perhaps from a more general plotting perspective (matplotlib, specifically) missing data is simply missing; we don't have it, we never will, but we need to do the best calculation (or plot) we can with what is left. For plotting, that generally means showing a gap in a line, a hole in a contour plot, etc. For calculations like basic statistics, it means doing the calculation, e.g. a mean, with the available numbers, *and* having an easy way to find out how many numbers were available. That's what the masked array count() method is for. Some types of calculations, like the FFT, simply can't be done by ignoring missing values, so one must first use some filling method, perhaps interpolation, for example, and then pass an unmasked array to the function. The present masked array module is very close to what is really needed for the sorts of things I am involved with. It looks to me like the main deficiencies are addressed by Mark's proposal, although the change in the definition of the mask might make for a painful transition. Eric > > Also, what should 'dot' do with missing values? > > -- Nathaniel > > On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: >> Enthought has asked me to look into the "missing data" problem and how NumPy >> could treat it better. I've considered the different ideas of adding dtype >> variants with a special signal value and masked arrays, and concluded that >> adding masks to the core ndarray appears is the best way to deal with the >> problem in general. >> I've written a NEP that proposes a particular design, viewable here: >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> There are some questions at the bottom of the NEP which definitely need >> discussion to find the best design choices. Please read, and let me know of >> all the errors and gaps you find in the document. >> Thanks, >> Mark >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From mwwiebe at gmail.com Thu Jun 23 17:55:00 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 16:55:00 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris wrote: > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: > >> Enthought has asked me to look into the "missing data" problem and how >> NumPy could treat it better. I've considered the different ideas of adding >> dtype variants with a special signal value and masked arrays, and concluded >> that adding masks to the core ndarray appears is the best way to deal with >> the problem in general. >> >> I've written a NEP that proposes a particular design, viewable here: >> >> >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> >> There are some questions at the bottom of the NEP which definitely need >> discussion to find the best design choices. Please read, and let me know of >> all the errors and gaps you find in the document. >> >> > I agree that low level support for masks is the way to go. > > > If all the input values are masked, 'sum' and 'prod' will produce the > additive and multiplicative identities respectively > > A masked zero dimensional array might be another option, depending on how > you handle scalars. This would also work when arrays were summed down an > axis if a masked array was returned. > I think there has to be a difference like with "sum" and "nansum". Maybe control over this would be a parameter to the sum function, indicating how to interpret masked values. > I suppose the problem with using the word 'mask' is the implication that it > hides something. Maybe 'window' would be an alternate choice, although in > this context I tend to think of 'mask' as having the meaning you assign to > it. > Some copy/paste from the NEP: There is some consternation about the conventional True/False interpretation of the mask, centered around the name "mask". One possibility to deal with this is to call it a "validity mask" in all documentation, which more clearly indicates that True means valid data. If this isn't sufficient, an alternate name for the attribute could be found, like "a.validitymask", "a.validmask", or "a.validity". -Mark > Chuck > > >> Thanks, >> Mark >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 18:02:03 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 16:02:03 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <20110623214850.GD8017@phare.normalesup.org> References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 3:48 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Thu, Jun 23, 2011 at 03:53:31PM -0500, Mark Wiebe wrote: > > concluded that adding masks to the core ndarray appears is the best > way to > > deal with the problem in general. > > It seems to me that this is going to make the numpy array a way more > complex object. Althought it is currently quite simple, that object has > already a hard time getting acceptance beyond the scientific community, > whereas it should really be used in many other places. > > Right now, the numpy array can be seen as an extension of the C array, > basically a pointer, a data type, and a shape (and strides). This enables > easy sharing with libraries that have not been written with numpy in > mind. > > The limitations of the subclassing approach that you mention do not seem > fundemental to me. For instance the impossibility to mix subclasses could > perhaps be solved using the Mixin Pattern. Urghh. Um, excuse me. > Ufuncs need work, but I have > the impression that your proposal is simply to solve the special case of > masked data in the ufunc by breaking the simple numpy array model. > > I wonder how much of the complication could be located in the dtype. > By moving in the core a growing amount of functionality, it seems to me > that you are going to make it more and more complex while loosing its > genericity. Each new feature will need to go in the core and induce a > high cost. Making inheritance and unfuncs more generic seems to me like a > better investment. > > Inheritance is an overused idea, mostly because it is misused to provide functionality. Mixins are another way to do the same but they have their own problems. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 23 18:04:27 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 23 Jun 2011 17:04:27 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 17:02, Charles R Harris wrote: > > On Thu, Jun 23, 2011 at 3:48 PM, Gael Varoquaux > wrote: >> Ufuncs need work, but I have >> the impression that your proposal is simply to solve the special case of >> masked data in the ufunc by breaking the simple numpy array model. > > I wonder how much of the complication could be located in the dtype. What dtype? There are no new dtypes in this proposal. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From kwgoodman at gmail.com Thu Jun 23 18:05:21 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Thu, 23 Jun 2011 15:05:21 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > Enthought has asked me to look into the "missing data" problem and how NumPy > could treat it better. I've considered the different ideas of adding dtype > variants with a special signal value and masked arrays, and concluded that > adding masks to the core ndarray appears is the best way to deal with the > problem in general. > I've written a NEP that proposes a particular design, viewable here: > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > There are some questions at the bottom of the NEP which definitely need > discussion to find the best design choices. Please read, and let me know of > all the errors and gaps you find in the document. Wow, that is exciting. I wonder about the relative performance of the two possible implementations (mask and NA) in the PEP. If you are, say, doing a calculation along the columns of a 2d array one element at a time, then you will need to grab an element from the array and grab the corresponding element from the mask. I assume the corresponding data and mask elements are not stored together. That would be slow since memory access is usually were time is spent. In this regard NA would be faster. I currently use NaN as a missing data marker. That adds things like this to my cython code: if a[i] == a[i]: asum += a[i] If NA also had the property NA == NA is False, then it would be easy to use. A mask, on the other hand, would be more difficult for third party packages to support. You have to check if the mask is present and if so do a mask-aware calculation; if is it not present then you have to do a non-mask based calculation. So you have two code paths. You also need to check if any of the input arrays have masks and if so apply masks to the other inputs, etc. From charlesr.harris at gmail.com Thu Jun 23 18:05:33 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 16:05:33 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 4:04 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 17:02, Charles R Harris > wrote: > > > > On Thu, Jun 23, 2011 at 3:48 PM, Gael Varoquaux > > wrote: > > >> Ufuncs need work, but I have > >> the impression that your proposal is simply to solve the special case of > >> masked data in the ufunc by breaking the simple numpy array model. > > > > I wonder how much of the complication could be located in the dtype. > > What dtype? There are no new dtypes in this proposal. > > Not yet, there aren't. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 18:06:13 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 17:06:13 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <20110623214850.GD8017@phare.normalesup.org> References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 4:48 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Thu, Jun 23, 2011 at 03:53:31PM -0500, Mark Wiebe wrote: > > concluded that adding masks to the core ndarray appears is the best > way to > > deal with the problem in general. > > It seems to me that this is going to make the numpy array a way more > complex object. Althought it is currently quite simple, that object has > already a hard time getting acceptance beyond the scientific community, > whereas it should really be used in many other places. > I don't think it's the simplicity or complexity of the object which is preventing more acceptance, but rather the implementation issues it currently has. I've done a lot to smooth out the rough edges already, for example the cleanup to structured arrays I just merged fixes a number of things which just worked poorly or couldn't be done before. Right now, the numpy array can be seen as an extension of the C array, > basically a pointer, a data type, and a shape (and strides). This enables > easy sharing with libraries that have not been written with numpy in > mind. > It's not that simple, unfortunately, the C API does not provide a good interface for this. I think a good C++ API needs to be created, because C doesn't have enough expressive power to make interoperability simple. The limitations of the subclassing approach that you mention do not seem > fundemental to me. For instance the impossibility to mix subclasses could > perhaps be solved using the Mixin Pattern. Ufuncs need work, but I have > the impression that your proposal is simply to solve the special case of > masked data in the ufunc by breaking the simple numpy array model. > This doesn't break the numpy array model, that stays exactly as it is. It does provide a consistent way to deal with masks that align with those arrays. The subclassing mechanism doesn't provide a way to give a robust abstraction of the missing value phenomenon, and that is what my proposal is providing. A simple API call for default behaviors when masks aren't supported seems reasonable to me. By moving in the core a growing amount of functionality, it seems to me > that you are going to make it more and more complex while loosing its > genericity. Each new feature will need to go in the core and induce a > high cost. Making inheritance and unfuncs more generic seems to me like a > better investment. > Putting masks into the core increases the genericity, because it makes that feature orthogonal to all the other features in a very clean way. The problems you are raising are potential issues if the implementation is done poorly, not fundamental issues with the idea. > > My 2 cents, > > Gael > > PS: I am on the verge of conference travel, so I will not be able to > participate any further in the discussion. > Thanks for taking the time to respond, -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 23 18:08:06 2011 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 23 Jun 2011 17:08:06 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 17:05, Charles R Harris wrote: > > On Thu, Jun 23, 2011 at 4:04 PM, Robert Kern wrote: >> >> On Thu, Jun 23, 2011 at 17:02, Charles R Harris >> wrote: >> > >> > On Thu, Jun 23, 2011 at 3:48 PM, Gael Varoquaux >> > wrote: >> >> >> Ufuncs need work, but I have >> >> the impression that your proposal is simply to solve the special case >> >> of >> >> masked data in the ufunc by breaking the simple numpy array model. >> > >> > I wonder how much of the complication could be located in the dtype. >> >> What dtype? There are no new dtypes in this proposal. > > Not yet, there aren't. You're being cryptic. The entire point of the proposal seems to be to *avoid* new dtypes for the purpose of handling missing data. What are you trying to refer to? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Thu Jun 23 18:14:16 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 17:14:16 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E03B625.4000302@hawaii.edu> References: <4E03B625.4000302@hawaii.edu> Message-ID: On Thu, Jun 23, 2011 at 4:54 PM, Eric Firing wrote: > On 06/23/2011 11:19 AM, Nathaniel Smith wrote: > > I'd like to see a statement of what the "missing data problem" is, and > > how this solves it? Because I don't think this is entirely intuitive, > > or that everyone necessarily has the same idea. > > > >> Reduction operations like 'sum', 'prod', 'min', and 'max' will operate > as if the values weren't there > > > > For context: My experience with missing data is in statistical > > analysis; I find R's NA support to be pretty awesome for those > > purposes. The conceptual model it's based on is that an NA value is > > some number that we just happen not to know. So from this perspective, > > I find it pretty confusing that adding an unknown quantity to 3 should > > result in 3, rather than another unknown quantity. (Obviously it > > should be possible to compute the sum of the known values, but IME > > it's important for the default behavior to be to fail loudly when > > things are wonky, not to silently patch them up, possibly > > incorrectly!) > > From the oceanographic data acquisition and analysis perspective, and > perhaps from a more general plotting perspective (matplotlib, > specifically) missing data is simply missing; we don't have it, we never > will, but we need to do the best calculation (or plot) we can with what > is left. For plotting, that generally means showing a gap in a line, a > hole in a contour plot, etc. For calculations like basic statistics, it > means doing the calculation, e.g. a mean, with the available numbers, > *and* having an easy way to find out how many numbers were available. > That's what the masked array count() method is for. > I'm thinking a parameter for sum, mean, etc which enables this interpretation is a good approach for these calculations. Some types of calculations, like the FFT, simply can't be done by > ignoring missing values, so one must first use some filling method, > perhaps interpolation, for example, and then pass an unmasked array to > the function. > These kinds of functions will have to raise exceptions when called on an array which has an masked value, true. > > The present masked array module is very close to what is really needed > for the sorts of things I am involved with. It looks to me like the > main deficiencies are addressed by Mark's proposal, although the change > in the definition of the mask might make for a painful transition. > Yeah, I understand the pain, but I'd much prefer to align with the general consensus about masks elsewhere than stick with the current convention. -Mark > > Eric > > > > > Also, what should 'dot' do with missing values? > > > > -- Nathaniel > > > > On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > >> Enthought has asked me to look into the "missing data" problem and how > NumPy > >> could treat it better. I've considered the different ideas of adding > dtype > >> variants with a special signal value and masked arrays, and concluded > that > >> adding masks to the core ndarray appears is the best way to deal with > the > >> problem in general. > >> I've written a NEP that proposes a particular design, viewable here: > >> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> There are some questions at the bottom of the NEP which definitely need > >> discussion to find the best design choices. Please read, and let me know > of > >> all the errors and gaps you find in the document. > >> Thanks, > >> Mark > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 18:14:32 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 16:14:32 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <20110623214850.GD8017@phare.normalesup.org> Message-ID: On Thu, Jun 23, 2011 at 4:08 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 17:05, Charles R Harris > wrote: > > > > On Thu, Jun 23, 2011 at 4:04 PM, Robert Kern > wrote: > >> > >> On Thu, Jun 23, 2011 at 17:02, Charles R Harris > >> wrote: > >> > > >> > On Thu, Jun 23, 2011 at 3:48 PM, Gael Varoquaux > >> > wrote: > >> > >> >> Ufuncs need work, but I have > >> >> the impression that your proposal is simply to solve the special case > >> >> of > >> >> masked data in the ufunc by breaking the simple numpy array model. > >> > > >> > I wonder how much of the complication could be located in the dtype. > >> > >> What dtype? There are no new dtypes in this proposal. > > > > Not yet, there aren't. > > You're being cryptic. The entire point of the proposal seems to be to > *avoid* new dtypes for the purpose of handling missing data. What are > you trying to refer to? > > I'm being sideways. I was just toying with the idea of masked dtypes. That way some parts of mask handling go over into finding mutuallly compatible types and arrays remain as collections of same sized bits of data. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Thu Jun 23 18:24:10 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 00:24:10 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris wrote: > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: > Enthought has asked me to look into the "missing data" problem and how NumPy could treat it better. I've considered the different ideas of adding dtype variants with a special signal value and masked arrays, and concluded that adding masks to the core ndarray appears is the best way to deal with the problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst Mmh, after timeseries, now masked arrays... Mark, I start to see a pattern here ;) > There are some questions at the bottom of the NEP which definitely need discussion to find the best design choices. Please read, and let me know of all the errors and gaps you find in the document. > > > I agree that low level support for masks is the way to go. The objective was to have numpy.ma in C, yes. Been clear since Numeric, but nobody had time to do it. And I still don't speak C, so Python it was. Anyhow, yes, there should be some work to address some of numpy.ma shortcomings. I may a bit conservative, but I don't really see the reason to follow a radically different approach. Your idea of switching the current convention of mask (a True meaning that the data can be accessed) will lead to a lot of fun indeed. And sorry, what "general consensus about masks elsewhere" are you referring to ? > There is some consternation about the conventional True/False > interpretation of the mask, centered around the name "mask". Don't call it "mask" at all then. "accessible" ? "access" ? Avoid "valid", it's too connotated. From mwwiebe at gmail.com Thu Jun 23 18:24:40 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 17:24:40 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman wrote: > On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > > Enthought has asked me to look into the "missing data" problem and how > NumPy > > could treat it better. I've considered the different ideas of adding > dtype > > variants with a special signal value and masked arrays, and concluded > that > > adding masks to the core ndarray appears is the best way to deal with the > > problem in general. > > I've written a NEP that proposes a particular design, viewable here: > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > There are some questions at the bottom of the NEP which definitely need > > discussion to find the best design choices. Please read, and let me know > of > > all the errors and gaps you find in the document. > > Wow, that is exciting. > > I wonder about the relative performance of the two possible > implementations (mask and NA) in the PEP. > I've given that some thought, and I don't think there's a clear way to tell what the performance gap would be without implementations of both to benchmark against each other. I favor the mask primarily because it provides masking for all data types in one go with a single consistent interface to program against. For adding NA signal values, each new data type would need a lot of work to gain the same level of support. If you are, say, doing a calculation along the columns of a 2d array > one element at a time, then you will need to grab an element from the > array and grab the corresponding element from the mask. I assume the > corresponding data and mask elements are not stored together. That > would be slow since memory access is usually were time is spent. In > this regard NA would be faster. > Yes, the masks add more memory traffic and some extra calculation, while the NA signal values just require some additional calculations. I currently use NaN as a missing data marker. That adds things like > this to my cython code: > > if a[i] == a[i]: > asum += a[i] > > If NA also had the property NA == NA is False, then it would be easy > to use. That's what I believe it should do, and I guess this is a strike against the idea of returning None for a single missing value. > A mask, on the other hand, would be more difficult for third > party packages to support. You have to check if the mask is present > and if so do a mask-aware calculation; if is it not present then you > have to do a non-mask based calculation. I actually see the mask as being easier for third party packages to support, particularly from C. Having regular C-friendly values with a boolean mask is a lot friendlier than values that require a lot of special casing like the NA signal values would require. > So you have two code paths. > You also need to check if any of the input arrays have masks and if so > apply masks to the other inputs, etc. > Most of the time, the masks will transparently propagate or not along with the arrays, with no effort required. In Python, the code you write would be virtually the same between the two approaches. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 18:43:13 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 17:43:13 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: > > On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: > > > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: > > Enthought has asked me to look into the "missing data" problem and how > NumPy could treat it better. I've considered the different ideas of adding > dtype variants with a special signal value and masked arrays, and concluded > that adding masks to the core ndarray appears is the best way to deal with > the problem in general. > > > > I've written a NEP that proposes a particular design, viewable here: > > > > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > > Mmh, after timeseries, now masked arrays... Mark, I start to see a pattern > here ;) I think it speaks to what's on Enthought's mind, in any case. :) > There are some questions at the bottom of the NEP which definitely need > discussion to find the best design choices. Please read, and let me know of > all the errors and gaps you find in the document. > > > > > > I agree that low level support for masks is the way to go. > > The objective was to have numpy.ma in C, yes. Been clear since Numeric, > but nobody had time to do it. And I still don't speak C, so Python it was. > Anyhow, yes, there should be some work to address some of numpy.mashortcomings. I may a bit conservative, but I don't really see the reason to > follow a radically different approach. Your idea of switching the current > convention of mask (a True meaning that the data can be accessed) will lead > to a lot of fun indeed. And sorry, what "general consensus about masks > elsewhere" are you referring to ? > I've used masks for many things in many contexts, and was quite surprised the first time I saw the convention being used in numpy.ma. In image processing, it's well established that a mask is white (1) for valid pixels and black (0) for missing/transparent pixels. Applying a mask to an image is a multiplication, something which works out very nicely. It also corresponds nicely with array boolean indexing, where a[a != 0] uses the mask "a != 0" to select nonzero elements. For at least a couple of releases, arrays with masks and instances of numpy.ma will have to coexist, and I'm hoping the new implementation will be intuitive enough that the transition won't be too crazy. > > There is some consternation about the conventional True/False > > interpretation of the mask, centered around the name "mask". > > Don't call it "mask" at all then. "accessible" ? "access" ? Avoid "valid", > it's too connotated. > Coming up with a good name will take some thinking. Of the names so far, "validity" is my favorite. -Mark > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 19:02:40 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 17:02:40 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 4:43 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: > >> >> On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: >> >> > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: >> > Enthought has asked me to look into the "missing data" problem and how >> NumPy could treat it better. I've considered the different ideas of adding >> dtype variants with a special signal value and masked arrays, and concluded >> that adding masks to the core ndarray appears is the best way to deal with >> the problem in general. >> > >> > I've written a NEP that proposes a particular design, viewable here: >> > >> > >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> >> Mmh, after timeseries, now masked arrays... Mark, I start to see a pattern >> here ;) > > > I think it speaks to what's on Enthought's mind, in any case. :) > What is the thinking at Enthought about this? I sense a meeting in the background and it would be nice to know what the motivations were. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 19:14:16 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 18:14:16 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 6:02 PM, Charles R Harris wrote: > On Thu, Jun 23, 2011 at 4:43 PM, Mark Wiebe wrote: > >> On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: >> >>> >>> On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: >>> >>> > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe wrote: >>> > Enthought has asked me to look into the "missing data" problem and how >>> NumPy could treat it better. I've considered the different ideas of adding >>> dtype variants with a special signal value and masked arrays, and concluded >>> that adding masks to the core ndarray appears is the best way to deal with >>> the problem in general. >>> > >>> > I've written a NEP that proposes a particular design, viewable here: >>> > >>> > >>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>> >>> Mmh, after timeseries, now masked arrays... Mark, I start to see a >>> pattern here ;) >> >> >> I think it speaks to what's on Enthought's mind, in any case. :) >> > > What is the thinking at Enthought about this? I sense a meeting in the > background and it would be nice to know what the motivations were. > A lot of these things were discussed at the DataArray summit they held here a while ago, but the general push is towards being more approachable and friendly to people who use R or do statistical data analysis on possibly messy data. I don't use R, but I think I'm fairly good at producing things which are both powerful and intuitive in the end, which is what I'm trying to do here. That's my understanding of the thinking, others at Enthought will have to correct me if I'm wrong... -Mark > > > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 19:33:44 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 17:33:44 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:14 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 6:02 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> On Thu, Jun 23, 2011 at 4:43 PM, Mark Wiebe wrote: >> >>> On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: >>> >>>> >>>> On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: >>>> >>>> > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris < >>>> charlesr.harris at gmail.com> wrote: >>>> > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe >>>> wrote: >>>> > Enthought has asked me to look into the "missing data" problem and how >>>> NumPy could treat it better. I've considered the different ideas of adding >>>> dtype variants with a special signal value and masked arrays, and concluded >>>> that adding masks to the core ndarray appears is the best way to deal with >>>> the problem in general. >>>> > >>>> > I've written a NEP that proposes a particular design, viewable here: >>>> > >>>> > >>>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>>> >>>> Mmh, after timeseries, now masked arrays... Mark, I start to see a >>>> pattern here ;) >>> >>> >>> I think it speaks to what's on Enthought's mind, in any case. :) >>> >> >> What is the thinking at Enthought about this? I sense a meeting in the >> background and it would be nice to know what the motivations were. >> > > A lot of these things were discussed at the DataArray summit they held here > a while ago, but the general push is towards being more approachable and > friendly to people who use R or do statistical data analysis on possibly > messy data. I don't use R, but I think I'm fairly good at producing things > which are both powerful and intuitive in the end, which is what I'm trying > to do here. That's my understanding of the thinking, others at Enthought > will have to correct me if I'm wrong... > > One thing that wasn't clear to me in the NEP was how one might experiment with different masks on a fixed set of data. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 19:36:15 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 18:36:15 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 6:33 PM, Charles R Harris wrote: > > > On Thu, Jun 23, 2011 at 5:14 PM, Mark Wiebe wrote: > >> On Thu, Jun 23, 2011 at 6:02 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> On Thu, Jun 23, 2011 at 4:43 PM, Mark Wiebe wrote: >>> >>>> On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: >>>> >>>>> >>>>> On Jun 23, 2011, at 11:55 PM, Mark Wiebe wrote: >>>>> >>>>> > On Thu, Jun 23, 2011 at 4:46 PM, Charles R Harris < >>>>> charlesr.harris at gmail.com> wrote: >>>>> > On Thu, Jun 23, 2011 at 2:53 PM, Mark Wiebe >>>>> wrote: >>>>> > Enthought has asked me to look into the "missing data" problem and >>>>> how NumPy could treat it better. I've considered the different ideas of >>>>> adding dtype variants with a special signal value and masked arrays, and >>>>> concluded that adding masks to the core ndarray appears is the best way to >>>>> deal with the problem in general. >>>>> > >>>>> > I've written a NEP that proposes a particular design, viewable here: >>>>> > >>>>> > >>>>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>>>> >>>>> Mmh, after timeseries, now masked arrays... Mark, I start to see a >>>>> pattern here ;) >>>> >>>> >>>> I think it speaks to what's on Enthought's mind, in any case. :) >>>> >>> >>> What is the thinking at Enthought about this? I sense a meeting in the >>> background and it would be nice to know what the motivations were. >>> >> >> A lot of these things were discussed at the DataArray summit they held >> here a while ago, but the general push is towards being more approachable >> and friendly to people who use R or do statistical data analysis on possibly >> messy data. I don't use R, but I think I'm fairly good at producing things >> which are both powerful and intuitive in the end, which is what I'm trying >> to do here. That's my understanding of the thinking, others at Enthought >> will have to correct me if I'm wrong... >> >> > One thing that wasn't clear to me in the NEP was how one might experiment > with different masks on a fixed set of data. > You could create multiple views of the same array, then add a different mask to each one. -Mark > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Thu Jun 23 19:39:07 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 01:39:07 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <416C2B7C-3EBB-48E2-BBB5-1E3328ABEC54@gmail.com> On Jun 24, 2011, at 12:43 AM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 5:24 PM, Pierre GM wrote: > ... > Mmh, after timeseries, now masked arrays... Mark, I start to see a pattern here ;) > > I think it speaks to what's on Enthought's mind, in any case. :) Eh... > > Anyhow, yes, there should be some work to address some of numpy.ma shortcomings. I may a bit conservative, but I don't really see the reason to follow a radically different approach. Your idea of switching the current convention of mask (a True meaning that the data can be accessed) will lead to a lot of fun indeed. And sorry, what "general consensus about masks elsewhere" are you referring to ? > > I've used masks for many things in many contexts, and was quite surprised the first time I saw the convention being used in numpy.ma. In image processing, it's well established that a mask is white (1) for valid pixels and black (0) for missing/transparent pixels. Applying a mask to an image is a multiplication, something which works out very nicely. It also corresponds nicely with array boolean indexing, where a[a != 0] uses the mask "a != 0" to select nonzero elements. OK. At the time I worked on numpy.ma, I was still doing GIS, and a mask was just an extra layer that would hide some unwanted features. So True if the mask is on, False otherwise made sense. Besides, that was the existing convention in Numeric, so my hands were tied. > For at least a couple of releases, arrays with masks and instances of numpy.ma will have to coexist, and I'm hoping the new implementation will be intuitive enough that the transition won't be too crazy. Well, "intuitive" can be a fairly subjective notion... And frankly, I still fail to see the point. After all, it's a convention. > > There is some consternation about the conventional True/False > > interpretation of the mask, centered around the name "mask". > > Don't call it "mask" at all then. "accessible" ? "access" ? Avoid "valid", it's too connotated. > > Coming up with a good name will take some thinking. Of the names so far, "validity" is my favorite. A rose by any other name... Anyhow, more comments to follow. From Chris.Barker at noaa.gov Thu Jun 23 19:39:56 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Thu, 23 Jun 2011 16:39:56 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E03CECC.2030705@noaa.gov> Nathaniel Smith wrote: > IME > it's important for the default behavior to be to fail loudly when > things are wonky, not to silently patch them up, possibly > incorrectly!) +1 for default behavior to fail, raise errors, or return "unknown values" is these situations. Though that behavior should be over-ridable with a flag or addition functions (like nan-sum) Robert Kern wrote: > How your proposal > improves on the status quo, i.e. numpy.ma. I've thought for years that masked arrays should be better integrated with numpy -- it seems there is a lot of code duplication and upkeep required to keep masked arrays working as much like regular arrays as possible. Particularly having third-party code work with them. And I'm not sure I can explain why, but I find myself hardly ever using masked arrays -- it just feels like added complication to use them -- I tend to work with floats and use NaN instead, even though it's often not as good an option. Gael Varoquaux wrote: > Right now, the numpy array can be seen as an extension of the C array, > basically a pointer, a data type, and a shape (and strides). This enables > easy sharing with libraries that have not been written with numpy in > mind. Isn't that what the PEP 3118 extended buffer protocol is supposed to be for? Anyway, good stuff! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From josef.pktd at gmail.com Thu Jun 23 19:51:25 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jun 2011 19:51:25 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith wrote: >> >> I'd like to see a statement of what the "missing data problem" is, and >> how this solves it? Because I don't think this is entirely intuitive, >> or that everyone necessarily has the same idea. > > I agree it represents different problems in different contexts. For NumPy, I > think the mechanism for dealing with it needs to be intuitive to work with > in a maximum number of contexts, avoiding surprises. Getting feedback from a > broad range of people is the only way a general solution can be designed > with any level of confidence. >> >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate >> > as if the values weren't there >> >> For context: My experience with missing data is in statistical >> analysis; I find R's NA support to be pretty awesome for those >> purposes. The conceptual model it's based on is that an NA value is >> some number that we just happen not to know. So from this perspective, >> I find it pretty confusing that adding an unknown quantity to 3 should >> result in 3, rather than another unknown quantity. (Obviously it >> should be possible to compute the sum of the known values, but IME >> it's important for the default behavior to be to fail loudly when >> things are wonky, not to silently patch them up, possibly >> incorrectly!) > > The conceptual model you describe sounds reasonable to me, and I definitely > like the idea of consistently following one such model for all default > behaviors. > >> >> Also, what should 'dot' do with missing values? > > A matrix multiplication is defined in terms of sums of products, so it can > be implemented to behave consistently with your conceptual model. >From the perspective of statistical analysis, I don't see much advantage of this. What to do with nans depends on the analysis, and needs to be looked at for each case. Only easy descriptive statistics work without problems, nansum, .... All the other usages require rewriting the algorithm, see scipy.stats versus scipy.mstats. In R often the nan handling is remove all observations (rows) with at least one nan, or we go to some fancier imputation of missing values algorithms. What happens if I just want to go back and forth between using Lapack and minpack, none of them suddenly grow missing values handling, and if they would it might not be what we want. arrays with nans are nice for data handling, but I don't see why we should pay for any overhead for number crunching with numpy arrays. Josef > >> >> -- Nathaniel >> >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: >> > Enthought has asked me to look into the "missing data" problem and how >> > NumPy >> > could treat it better. I've considered the different ideas of adding >> > dtype >> > variants with a special signal value and masked arrays, and concluded >> > that >> > adding masks to the core ndarray appears is the best way to deal with >> > the >> > problem in general. >> > I've written a NEP that proposes a particular design, viewable here: >> > >> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> > There are some questions at the bottom of the NEP which definitely need >> > discussion to find the best design choices. Please read, and let me know >> > of >> > all the errors and gaps you find in the document. >> > Thanks, >> > Mark >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From mwwiebe at gmail.com Thu Jun 23 19:52:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 18:52:17 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E03CECC.2030705@noaa.gov> References: <4E03CECC.2030705@noaa.gov> Message-ID: On Thu, Jun 23, 2011 at 6:39 PM, Christopher Barker wrote: > Nathaniel Smith wrote: > > IME > > it's important for the default behavior to be to fail loudly when > > things are wonky, not to silently patch them up, possibly > > incorrectly!) > > +1 for default behavior to fail, raise errors, or return "unknown > values" is these situations. > > Though that behavior should be over-ridable with a flag or addition > functions (like nan-sum) > Yeah, I'm changing the description to be this way. Robert Kern wrote: > > How your proposal > > improves on the status quo, i.e. numpy.ma. > > I've thought for years that masked arrays should be better integrated > with numpy -- it seems there is a lot of code duplication and upkeep > required to keep masked arrays working as much like regular arrays as > possible. Particularly having third-party code work with them. > > And I'm not sure I can explain why, but I find myself hardly ever using > masked arrays -- it just feels like added complication to use them -- I > tend to work with floats and use NaN instead, even though it's often not > as good an option. > > Gael Varoquaux wrote: > > Right now, the numpy array can be seen as an extension of the C array, > > basically a pointer, a data type, and a shape (and strides). This enables > > easy sharing with libraries that have not been written with numpy in > > mind. > > Isn't that what the PEP 3118 extended buffer protocol is supposed to be > for? > PEP 3118 is for allowing multiple Python plugins to communicate about regularly-structured array data without explicitly knowing about each other, yes. It doesn't specify anything about masks, I'm thinking disallowing 3118 views from arrays with masks is the most consistent way to deal with that. -Mark > > Anyway, good stuff! > > -Chris > > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 19:57:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 18:57:44 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 6:51 PM, wrote: > On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith wrote: > >> > >> I'd like to see a statement of what the "missing data problem" is, and > >> how this solves it? Because I don't think this is entirely intuitive, > >> or that everyone necessarily has the same idea. > > > > I agree it represents different problems in different contexts. For > NumPy, I > > think the mechanism for dealing with it needs to be intuitive to work > with > > in a maximum number of contexts, avoiding surprises. Getting feedback > from a > > broad range of people is the only way a general solution can be designed > > with any level of confidence. > >> > >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate > >> > as if the values weren't there > >> > >> For context: My experience with missing data is in statistical > >> analysis; I find R's NA support to be pretty awesome for those > >> purposes. The conceptual model it's based on is that an NA value is > >> some number that we just happen not to know. So from this perspective, > >> I find it pretty confusing that adding an unknown quantity to 3 should > >> result in 3, rather than another unknown quantity. (Obviously it > >> should be possible to compute the sum of the known values, but IME > >> it's important for the default behavior to be to fail loudly when > >> things are wonky, not to silently patch them up, possibly > >> incorrectly!) > > > > The conceptual model you describe sounds reasonable to me, and I > definitely > > like the idea of consistently following one such model for all default > > behaviors. > > > >> > >> Also, what should 'dot' do with missing values? > > > > A matrix multiplication is defined in terms of sums of products, so it > can > > be implemented to behave consistently with your conceptual model. > > >From the perspective of statistical analysis, I don't see much > advantage of this. > What to do with nans depends on the analysis, and needs to be looked > at for each case. > Sure, the alternatives to implementing 'dot' with missing values are to raise an exception or produce all missing values. > Only easy descriptive statistics work without problems, nansum, .... > > All the other usages require rewriting the algorithm, see scipy.stats > versus scipy.mstats. In R often the nan handling is remove all > observations (rows) with at least one nan, or we go to some fancier > imputation of missing values algorithms. > > What happens if I just want to go back and forth between using Lapack > and minpack, none of them suddenly grow missing values handling, and > if they would it might not be what we want. > > arrays with nans are nice for data handling, but I don't see why we > should pay for any overhead for number crunching with numpy arrays. > NaNs are certainly not going away, they're completely independent of masked values. -Mark > > Josef > > > > > >> > >> -- Nathaniel > >> > >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > >> > Enthought has asked me to look into the "missing data" problem and how > >> > NumPy > >> > could treat it better. I've considered the different ideas of adding > >> > dtype > >> > variants with a special signal value and masked arrays, and concluded > >> > that > >> > adding masks to the core ndarray appears is the best way to deal with > >> > the > >> > problem in general. > >> > I've written a NEP that proposes a particular design, viewable here: > >> > > >> > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> > There are some questions at the bottom of the NEP which definitely > need > >> > discussion to find the best design choices. Please read, and let me > know > >> > of > >> > all the errors and gaps you find in the document. > >> > Thanks, > >> > Mark > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at scipy.org > >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Jun 23 20:00:38 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jun 2011 17:00:38 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: >> Enthought has asked me to look into the "missing data" problem and how NumPy >> could treat it better. I've considered the different ideas of adding dtype >> variants with a special signal value and masked arrays, and concluded that >> adding masks to the core ndarray appears is the best way to deal with the >> problem in general. >> I've written a NEP that proposes a particular design, viewable here: >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> There are some questions at the bottom of the NEP which definitely need >> discussion to find the best design choices. Please read, and let me know of >> all the errors and gaps you find in the document. > > One thing that could use more explanation is how your proposal > improves on the status quo, i.e. numpy.ma. As far as I can see, you > are mostly just shuffling around the functionality that already > exists. There has been a continual desire for something like R's NA > values by people who are very familiar with both R and numpy's masked > arrays. Both have their uses, and as Nathaniel points out, R's > approach seems to be very well-liked by a lot of users. In essence, > *that's* the "missing data problem" that you were charged with: making > happy the users who are currently dissatisfied with masked arrays. It > doesn't seem to me that moving the functionality from numpy.ma to > numpy.ndarray resolves any of their issues. Speaking as a user who's avoided numpy.ma, it wasn't actually because of the behavior I pointed out (I never got far enough to notice it), but because I got the distinct impression that it was a "second-class citizen" in numpy-land. I don't know if that's true. But I wasn't sure how solidly things like interactions between numpy and masked arrays worked, or how , and it seemed like it had more niche uses. So it just seemed like more hassle than it was worth for my purposes. Moving it into the core and making it really solid *would* address these issues... It does have to be solid, though. It occurs to me on further thought that one major advantage of having first-class "NA" values is that it preserves the standard looping idioms: for i in xrange(len(x)): x[i] = np.log(x[i]) According to the current proposal, this will blow up, but np.log(x) will work. That seems suboptimal to me. I do find the argument that we want a general solution compelling. I suppose we could have a magic "NA" value in Python-land which magically triggers fiddling with the mask when assigned to numpy arrays. It's should also be possible to accomplish a general solution at the dtype level. We could have a 'dtype factory' used like: np.zeros(10, dtype=np.maybe(float)) where np.maybe(x) returns a new dtype whose storage size is x.itemsize + 1, where the extra byte is used to store missingness information. (There might be some annoying alignment issues to deal with.) Then for each ufunc we define a handler for the maybe dtype (or add a special-case to the ufunc dispatch machinery) that checks the missingness value and then dispatches to the ordinary ufunc handler for the wrapped dtype. This would require fixing the issue where ufunc inner loops can't actually access the dtype object, but we should fix that anyway :-). -- Nathaniel From njs at pobox.com Thu Jun 23 20:03:48 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jun 2011 17:03:48 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:00 PM, Nathaniel Smith wrote: > Speaking as a user who's avoided numpy.ma, it wasn't actually because > of the behavior I pointed out (I never got far enough to notice it), > but because I got the distinct impression that it was a "second-class > citizen" in numpy-land. I don't know if that's true. But I wasn't sure > how solidly things like interactions between numpy and masked arrays > worked, or how and it seemed like it had more niche uses. So it just > seemed like more hassle than it was worth for my purposes. Moving it > into the core and making it really solid *would* address these > issues... Er, that should be, "or how to write generic code that handled both ordinary and masked arrays transparently,". Man, that's the second time today I've done that. I need to proofraed better... -- Nathaniel From charlesr.harris at gmail.com Thu Jun 23 20:17:28 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 18:17:28 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:51 PM, wrote: > On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith wrote: > >> > >> I'd like to see a statement of what the "missing data problem" is, and > >> how this solves it? Because I don't think this is entirely intuitive, > >> or that everyone necessarily has the same idea. > > > > I agree it represents different problems in different contexts. For > NumPy, I > > think the mechanism for dealing with it needs to be intuitive to work > with > > in a maximum number of contexts, avoiding surprises. Getting feedback > from a > > broad range of people is the only way a general solution can be designed > > with any level of confidence. > >> > >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate > >> > as if the values weren't there > >> > >> For context: My experience with missing data is in statistical > >> analysis; I find R's NA support to be pretty awesome for those > >> purposes. The conceptual model it's based on is that an NA value is > >> some number that we just happen not to know. So from this perspective, > >> I find it pretty confusing that adding an unknown quantity to 3 should > >> result in 3, rather than another unknown quantity. (Obviously it > >> should be possible to compute the sum of the known values, but IME > >> it's important for the default behavior to be to fail loudly when > >> things are wonky, not to silently patch them up, possibly > >> incorrectly!) > > > > The conceptual model you describe sounds reasonable to me, and I > definitely > > like the idea of consistently following one such model for all default > > behaviors. > > > >> > >> Also, what should 'dot' do with missing values? > > > > A matrix multiplication is defined in terms of sums of products, so it > can > > be implemented to behave consistently with your conceptual model. > > >From the perspective of statistical analysis, I don't see much > advantage of this. > What to do with nans depends on the analysis, and needs to be looked > at for each case. > > Only easy descriptive statistics work without problems, nansum, .... > > All the other usages require rewriting the algorithm, see scipy.stats > versus scipy.mstats. In R often the nan handling is remove all > observations (rows) with at least one nan, or we go to some fancier > imputation of missing values algorithms. > > What happens if I just want to go back and forth between using Lapack > and minpack, none of them suddenly grow missing values handling, and > if they would it might not be what we want. > > arrays with nans are nice for data handling, but I don't see why we > should pay for any overhead for number crunching with numpy arrays. > > I didn't get the impression that there would be noticeable overhead. On the other points, I think the idea should be to provide a low level mechanism that it flexible enough to allow implementation of various use cases at a higher level. For instance, current masked arrays could be reimplemented if desired, etc. Not that I think that should be done... Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 23 20:21:11 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 18:21:11 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 6:00 PM, Nathaniel Smith wrote: > On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern > wrote: > > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > >> Enthought has asked me to look into the "missing data" problem and how > NumPy > >> could treat it better. I've considered the different ideas of adding > dtype > >> variants with a special signal value and masked arrays, and concluded > that > >> adding masks to the core ndarray appears is the best way to deal with > the > >> problem in general. > >> I've written a NEP that proposes a particular design, viewable here: > >> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> There are some questions at the bottom of the NEP which definitely need > >> discussion to find the best design choices. Please read, and let me know > of > >> all the errors and gaps you find in the document. > > > > One thing that could use more explanation is how your proposal > > improves on the status quo, i.e. numpy.ma. As far as I can see, you > > are mostly just shuffling around the functionality that already > > exists. There has been a continual desire for something like R's NA > > values by people who are very familiar with both R and numpy's masked > > arrays. Both have their uses, and as Nathaniel points out, R's > > approach seems to be very well-liked by a lot of users. In essence, > > *that's* the "missing data problem" that you were charged with: making > > happy the users who are currently dissatisfied with masked arrays. It > > doesn't seem to me that moving the functionality from numpy.ma to > > numpy.ndarray resolves any of their issues. > > Speaking as a user who's avoided numpy.ma, it wasn't actually because > of the behavior I pointed out (I never got far enough to notice it), > but because I got the distinct impression that it was a "second-class > citizen" in numpy-land. I don't know if that's true. But I wasn't sure > how solidly things like interactions between numpy and masked arrays > worked, or how , and it seemed like it had more niche uses. So it just > seemed like more hassle than it was worth for my purposes. Moving it > into the core and making it really solid *would* address these > issues... > > There is some truth to that. The maintainer/creator of masked arrays was Pierre and he hasn't much time these days. Given that numpy will continue to advance and add features it is probably best for the masked array facility to be part of the basic toolset. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 20:21:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 19:21:44 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: > On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern > wrote: > > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > >> Enthought has asked me to look into the "missing data" problem and how > NumPy > >> could treat it better. I've considered the different ideas of adding > dtype > >> variants with a special signal value and masked arrays, and concluded > that > >> adding masks to the core ndarray appears is the best way to deal with > the > >> problem in general. > >> I've written a NEP that proposes a particular design, viewable here: > >> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> There are some questions at the bottom of the NEP which definitely need > >> discussion to find the best design choices. Please read, and let me know > of > >> all the errors and gaps you find in the document. > > > > One thing that could use more explanation is how your proposal > > improves on the status quo, i.e. numpy.ma. As far as I can see, you > > are mostly just shuffling around the functionality that already > > exists. There has been a continual desire for something like R's NA > > values by people who are very familiar with both R and numpy's masked > > arrays. Both have their uses, and as Nathaniel points out, R's > > approach seems to be very well-liked by a lot of users. In essence, > > *that's* the "missing data problem" that you were charged with: making > > happy the users who are currently dissatisfied with masked arrays. It > > doesn't seem to me that moving the functionality from numpy.ma to > > numpy.ndarray resolves any of their issues. > > Speaking as a user who's avoided numpy.ma, it wasn't actually because > of the behavior I pointed out (I never got far enough to notice it), > but because I got the distinct impression that it was a "second-class > citizen" in numpy-land. I don't know if that's true. But I wasn't sure > how solidly things like interactions between numpy and masked arrays > worked, or how , and it seemed like it had more niche uses. So it just > seemed like more hassle than it was worth for my purposes. Moving it > into the core and making it really solid *would* address these > issues... > These are definitely things I'm trying to address. It does have to be solid, though. It occurs to me on further thought > that one major advantage of having first-class "NA" values is that it > preserves the standard looping idioms: > > for i in xrange(len(x)): > x[i] = np.log(x[i]) > > According to the current proposal, this will blow up, but np.log(x) > will work. That seems suboptimal to me. > This boils down to the choice between None and a zero-dimensional array as the return value of 'x[i]'. This, and the desire that 'x[i] == x[i]' should be False if it's a masked value have convinced me that a zero-dimensional array is the way to go, and your example will work with this choice. > > I do find the argument that we want a general solution compelling. I > suppose we could have a magic "NA" value in Python-land which > magically triggers fiddling with the mask when assigned to numpy > arrays. > > It's should also be possible to accomplish a general solution at the > dtype level. We could have a 'dtype factory' used like: > np.zeros(10, dtype=np.maybe(float)) > where np.maybe(x) returns a new dtype whose storage size is x.itemsize > + 1, where the extra byte is used to store missingness information. > (There might be some annoying alignment issues to deal with.) Then for > each ufunc we define a handler for the maybe dtype (or add a > special-case to the ufunc dispatch machinery) that checks the > missingness value and then dispatches to the ordinary ufunc handler > for the wrapped dtype. > The 'dtype factory' idea builds on the way I've structured datetime as a parameterized type, but the thing that kills it for me is the alignment problems of 'x.itemsize + 1'. Having the mask in a separate memory block is a lot better than having to store 16 bytes for an 8-byte int to preserve the alignment. This would require fixing the issue where ufunc inner loops can't > actually access the dtype object, but we should fix that anyway :-). > Certainly true! -Mark > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Thu Jun 23 20:28:45 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 02:28:45 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Sorry y'all, I'm just commenting bits by bits: "One key problem is a lack of orthogonality with other features, for instance creating a masked array with physical quantities can't be done because both are separate subclasses of ndarray. The only reasonable way to deal with this is to move the mask into the core ndarray." Meh. I did try to make it easy to use masked arrays on top of subclasses. There's even some tests in the suite to that effect (test_subclassing). I'm not buying the argument. About moving mask in the core ndarray: I had suggested back in the days to have a mask flag/property built-in ndarrays (which would *really* have simplified the game), but this suggestion was dismissed very quickly as adding too much overload. I had to agree. I'm just a tad surprised the wind has changed on that matter. "In the current masked array, calculations are done for the whole array, then masks are patched up afterwords. This means that invalid calculations sitting in masked elements can raise warnings or exceptions even though they shouldn't, so the ufunc error handling mechanism can't be relied on." Well, there's a reason for that. Initially, I tried to guess what the mask of the output should be from the mask of the inputs, the objective being to avoid getting NaNs in the C array. That was easy in most cases, but it turned out it wasn't always possible (the `power` one caused me a lot of issues, if I recall correctly). So, for performance issues (to avoid a lot of expensive tests), I fell back on the old concept of "compute them all, they'll be sorted afterwards". Of course, that's rather clumsy an approach. But it works not too badly when in pure Python. No doubt that a proper C implementation would work faster. Oh, about using NaNs for invalid data ? Well, can't work with integers. `mask` property: Nothing to add to it. It's basically what we have now (except for the opposite convention). Working with masked values: I recall some strong points back in the days for not using None to represent missing values... Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag like we have now. Design questions: Adding `masked` or whatever we call it to a number/array should result is masked/a fully masked array, period. That way, we can have an idea that something was wrong with the initial dataset. hardmask: I never used the feature myself. I wonder if anyone did. Still, it's a nice idea... From pgmdevlist at gmail.com Thu Jun 23 20:29:47 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 02:29:47 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4865F8B7-F6BF-46F9-BB87-CC0D3FA25C6E@gmail.com> On Jun 24, 2011, at 2:21 AM, Charles R Harris wrote: > > > On Thu, Jun 23, 2011 at 6:00 PM, Nathaniel Smith wrote: > On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern wrote: > > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: > >> Enthought has asked me to look into the "missing data" problem and how NumPy > >> could treat it better. I've considered the different ideas of adding dtype > >> variants with a special signal value and masked arrays, and concluded that > >> adding masks to the core ndarray appears is the best way to deal with the > >> problem in general. > >> I've written a NEP that proposes a particular design, viewable here: > >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> There are some questions at the bottom of the NEP which definitely need > >> discussion to find the best design choices. Please read, and let me know of > >> all the errors and gaps you find in the document. > > > > One thing that could use more explanation is how your proposal > > improves on the status quo, i.e. numpy.ma. As far as I can see, you > > are mostly just shuffling around the functionality that already > > exists. There has been a continual desire for something like R's NA > > values by people who are very familiar with both R and numpy's masked > > arrays. Both have their uses, and as Nathaniel points out, R's > > approach seems to be very well-liked by a lot of users. In essence, > > *that's* the "missing data problem" that you were charged with: making > > happy the users who are currently dissatisfied with masked arrays. It > > doesn't seem to me that moving the functionality from numpy.ma to > > numpy.ndarray resolves any of their issues. > > Speaking as a user who's avoided numpy.ma, it wasn't actually because > of the behavior I pointed out (I never got far enough to notice it), > but because I got the distinct impression that it was a "second-class > citizen" in numpy-land. I don't know if that's true. But I wasn't sure > how solidly things like interactions between numpy and masked arrays > worked, or how , and it seemed like it had more niche uses. So it just > seemed like more hassle than it was worth for my purposes. Moving it > into the core and making it really solid *would* address these > issues... > > > There is some truth to that. The maintainer/creator of masked arrays was Pierre and he hasn't much time these days. So true. I'd be delighted to work at least part time on it, though. Send me a job offer :) From charlesr.harris at gmail.com Thu Jun 23 20:31:03 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 23 Jun 2011 18:31:03 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 6:21 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: > >> On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern >> wrote: >> > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: >> >> Enthought has asked me to look into the "missing data" problem and how >> NumPy >> >> could treat it better. I've considered the different ideas of adding >> dtype >> >> variants with a special signal value and masked arrays, and concluded >> that >> >> adding masks to the core ndarray appears is the best way to deal with >> the >> >> problem in general. >> >> I've written a NEP that proposes a particular design, viewable here: >> >> >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> >> There are some questions at the bottom of the NEP which definitely need >> >> discussion to find the best design choices. Please read, and let me >> know of >> >> all the errors and gaps you find in the document. >> > >> > One thing that could use more explanation is how your proposal >> > improves on the status quo, i.e. numpy.ma. As far as I can see, you >> > are mostly just shuffling around the functionality that already >> > exists. There has been a continual desire for something like R's NA >> > values by people who are very familiar with both R and numpy's masked >> > arrays. Both have their uses, and as Nathaniel points out, R's >> > approach seems to be very well-liked by a lot of users. In essence, >> > *that's* the "missing data problem" that you were charged with: making >> > happy the users who are currently dissatisfied with masked arrays. It >> > doesn't seem to me that moving the functionality from numpy.ma to >> > numpy.ndarray resolves any of their issues. >> >> Speaking as a user who's avoided numpy.ma, it wasn't actually because >> of the behavior I pointed out (I never got far enough to notice it), >> but because I got the distinct impression that it was a "second-class >> citizen" in numpy-land. I don't know if that's true. But I wasn't sure >> how solidly things like interactions between numpy and masked arrays >> worked, or how , and it seemed like it had more niche uses. So it just >> seemed like more hassle than it was worth for my purposes. Moving it >> into the core and making it really solid *would* address these >> issues... >> > > These are definitely things I'm trying to address. > > It does have to be solid, though. It occurs to me on further thought >> that one major advantage of having first-class "NA" values is that it >> preserves the standard looping idioms: >> >> for i in xrange(len(x)): >> x[i] = np.log(x[i]) >> >> According to the current proposal, this will blow up, but np.log(x) >> will work. That seems suboptimal to me. >> > > This boils down to the choice between None and a zero-dimensional array as > the return value of 'x[i]'. This, and the desire that 'x[i] == x[i]' should > be False if it's a masked value have convinced me that a zero-dimensional > array is the way to go, and your example will work with this choice. > > >> >> I do find the argument that we want a general solution compelling. I >> suppose we could have a magic "NA" value in Python-land which >> magically triggers fiddling with the mask when assigned to numpy >> arrays. >> >> It's should also be possible to accomplish a general solution at the >> dtype level. We could have a 'dtype factory' used like: >> np.zeros(10, dtype=np.maybe(float)) >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize >> + 1, where the extra byte is used to store missingness information. >> (There might be some annoying alignment issues to deal with.) Then for >> each ufunc we define a handler for the maybe dtype (or add a >> special-case to the ufunc dispatch machinery) that checks the >> missingness value and then dispatches to the ordinary ufunc handler >> for the wrapped dtype. >> > > The 'dtype factory' idea builds on the way I've structured datetime as a > parameterized type, but the thing that kills it for me is the alignment > problems of 'x.itemsize + 1'. Having the mask in a separate memory block is > a lot better than having to store 16 bytes for an 8-byte int to preserve the > alignment. > Yes, but that assumes it is appended to the existing types in the dtype individually instead of the dtype as a whole. The dtype with mask could just indicate a shadow array, an alpha channel if you will, that is essentially what you are already doing but just probide a different place to track it. > > This would require fixing the issue where ufunc inner loops can't >> actually access the dtype object, but we should fix that anyway :-). >> > > Certainly true! > > Chuck > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 20:34:11 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 19:34:11 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 7:31 PM, Charles R Harris wrote: > On Thu, Jun 23, 2011 at 6:21 PM, Mark Wiebe wrote: > >> On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: >> >>> On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern >>> wrote: >>> > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: >>> >> Enthought has asked me to look into the "missing data" problem and how >>> NumPy >>> >> could treat it better. I've considered the different ideas of adding >>> dtype >>> >> variants with a special signal value and masked arrays, and concluded >>> that >>> >> adding masks to the core ndarray appears is the best way to deal with >>> the >>> >> problem in general. >>> >> I've written a NEP that proposes a particular design, viewable here: >>> >> >>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>> >> There are some questions at the bottom of the NEP which definitely >>> need >>> >> discussion to find the best design choices. Please read, and let me >>> know of >>> >> all the errors and gaps you find in the document. >>> > >>> > One thing that could use more explanation is how your proposal >>> > improves on the status quo, i.e. numpy.ma. As far as I can see, you >>> > are mostly just shuffling around the functionality that already >>> > exists. There has been a continual desire for something like R's NA >>> > values by people who are very familiar with both R and numpy's masked >>> > arrays. Both have their uses, and as Nathaniel points out, R's >>> > approach seems to be very well-liked by a lot of users. In essence, >>> > *that's* the "missing data problem" that you were charged with: making >>> > happy the users who are currently dissatisfied with masked arrays. It >>> > doesn't seem to me that moving the functionality from numpy.ma to >>> > numpy.ndarray resolves any of their issues. >>> >>> Speaking as a user who's avoided numpy.ma, it wasn't actually because >>> of the behavior I pointed out (I never got far enough to notice it), >>> but because I got the distinct impression that it was a "second-class >>> citizen" in numpy-land. I don't know if that's true. But I wasn't sure >>> how solidly things like interactions between numpy and masked arrays >>> worked, or how , and it seemed like it had more niche uses. So it just >>> seemed like more hassle than it was worth for my purposes. Moving it >>> into the core and making it really solid *would* address these >>> issues... >>> >> >> These are definitely things I'm trying to address. >> >> It does have to be solid, though. It occurs to me on further thought >>> that one major advantage of having first-class "NA" values is that it >>> preserves the standard looping idioms: >>> >>> for i in xrange(len(x)): >>> x[i] = np.log(x[i]) >>> >>> According to the current proposal, this will blow up, but np.log(x) >>> will work. That seems suboptimal to me. >>> >> >> This boils down to the choice between None and a zero-dimensional array as >> the return value of 'x[i]'. This, and the desire that 'x[i] == x[i]' should >> be False if it's a masked value have convinced me that a zero-dimensional >> array is the way to go, and your example will work with this choice. >> >> >>> >>> I do find the argument that we want a general solution compelling. I >>> suppose we could have a magic "NA" value in Python-land which >>> magically triggers fiddling with the mask when assigned to numpy >>> arrays. >>> >>> It's should also be possible to accomplish a general solution at the >>> dtype level. We could have a 'dtype factory' used like: >>> np.zeros(10, dtype=np.maybe(float)) >>> where np.maybe(x) returns a new dtype whose storage size is x.itemsize >>> + 1, where the extra byte is used to store missingness information. >>> (There might be some annoying alignment issues to deal with.) Then for >>> each ufunc we define a handler for the maybe dtype (or add a >>> special-case to the ufunc dispatch machinery) that checks the >>> missingness value and then dispatches to the ordinary ufunc handler >>> for the wrapped dtype. >>> >> >> The 'dtype factory' idea builds on the way I've structured datetime as a >> parameterized type, but the thing that kills it for me is the alignment >> problems of 'x.itemsize + 1'. Having the mask in a separate memory block is >> a lot better than having to store 16 bytes for an 8-byte int to preserve the >> alignment. >> > > Yes, but that assumes it is appended to the existing types in the dtype > individually instead of the dtype as a whole. The dtype with mask could just > indicate a shadow array, an alpha channel if you will, that is essentially > what you are already doing but just probide a different place to track it. > This would seem to change the definition of a dtype - currently it represents a contiguous block of memory. It doesn't need to use all of that memory, but the dtype conceptually owns it. I kind of like it that way, where the whole strides idea with data being all over memory space belonging to ndarray, not dtype. -Mark > This would require fixing the issue where ufunc inner loops can't >>> actually access the dtype object, but we should fix that anyway :-). >>> >> >> Certainly true! >> >> > Chuck > >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 23 20:42:11 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 23 Jun 2011 19:42:11 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: > Sorry y'all, I'm just commenting bits by bits: > > "One key problem is a lack of orthogonality with other features, for > instance creating a masked array with physical quantities can't be done > because both are separate subclasses of ndarray. The only reasonable way to > deal with this is to move the mask into the core ndarray." > > Meh. I did try to make it easy to use masked arrays on top of subclasses. > There's even some tests in the suite to that effect (test_subclassing). I'm > not buying the argument. > About moving mask in the core ndarray: I had suggested back in the days to > have a mask flag/property built-in ndarrays (which would *really* have > simplified the game), but this suggestion was dismissed very quickly as > adding too much overload. I had to agree. I'm just a tad surprised the wind > has changed on that matter. Ok, I'll have to change that section then. :) I don't remember seeing mention of this ability in the documentation, but I may not have been reading closely enough for that part. > "In the current masked array, calculations are done for the whole array, > then masks are patched up afterwords. This means that invalid calculations > sitting in masked elements can raise warnings or exceptions even though they > shouldn't, so the ufunc error handling mechanism can't be relied on." > > Well, there's a reason for that. Initially, I tried to guess what the mask > of the output should be from the mask of the inputs, the objective being to > avoid getting NaNs in the C array. That was easy in most cases, but it > turned out it wasn't always possible (the `power` one caused me a lot of > issues, if I recall correctly). So, for performance issues (to avoid a lot > of expensive tests), I fell back on the old concept of "compute them all, > they'll be sorted afterwards". > Of course, that's rather clumsy an approach. But it works not too badly > when in pure Python. No doubt that a proper C implementation would work > faster. > Oh, about using NaNs for invalid data ? Well, can't work with integers. > In my proposal, NaNs stay as unmasked NaN values, instead of turning into masked values. This is necessary for uniform treatment of all dtypes, but a subclass could override this behavior with an extra mask modification after arithmetic operations. > `mask` property: > Nothing to add to it. It's basically what we have now (except for the > opposite convention). > > Working with masked values: > I recall some strong points back in the days for not using None to > represent missing values... > Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag > like we have now. > I'm not really a fan of all the global state that NumPy keeps, I guess I'm trying to stamp that out bit by bit as well where I can... Design questions: > Adding `masked` or whatever we call it to a number/array should result is > masked/a fully masked array, period. That way, we can have an idea that > something was wrong with the initial dataset. > I'm not sure I understand what you mean, in the design adding a mask means setting "a.mask = True", "a.mask = False", or "a.mask = " in general. > hardmask: I never used the feature myself. I wonder if anyone did. Still, > it's a nice idea... > Ok, I'll leave that out of the initial design unless someone comes up with some strong use cases. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Thu Jun 23 20:56:12 2011 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 23 Jun 2011 19:56:12 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: > Sorry y'all, I'm just commenting bits by bits: > > "One key problem is a lack of orthogonality with other features, for > instance creating a masked array with physical quantities can't be done > because both are separate subclasses of ndarray. The only reasonable way to > deal with this is to move the mask into the core ndarray." > > Meh. I did try to make it easy to use masked arrays on top of subclasses. > There's even some tests in the suite to that effect (test_subclassing). I'm > not buying the argument. > About moving mask in the core ndarray: I had suggested back in the days to > have a mask flag/property built-in ndarrays (which would *really* have > simplified the game), but this suggestion was dismissed very quickly as > adding too much overload. I had to agree. I'm just a tad surprised the wind > has changed on that matter. > > > "In the current masked array, calculations are done for the whole array, > then masks are patched up afterwords. This means that invalid calculations > sitting in masked elements can raise warnings or exceptions even though they > shouldn't, so the ufunc error handling mechanism can't be relied on." > > Well, there's a reason for that. Initially, I tried to guess what the mask > of the output should be from the mask of the inputs, the objective being to > avoid getting NaNs in the C array. That was easy in most cases, but it > turned out it wasn't always possible (the `power` one caused me a lot of > issues, if I recall correctly). So, for performance issues (to avoid a lot > of expensive tests), I fell back on the old concept of "compute them all, > they'll be sorted afterwards". > Of course, that's rather clumsy an approach. But it works not too badly > when in pure Python. No doubt that a proper C implementation would work > faster. > Oh, about using NaNs for invalid data ? Well, can't work with integers. > > `mask` property: > Nothing to add to it. It's basically what we have now (except for the > opposite convention). > > Working with masked values: > I recall some strong points back in the days for not using None to > represent missing values... > Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag > like we have now. > > Design questions: > Adding `masked` or whatever we call it to a number/array should result is > masked/a fully masked array, period. That way, we can have an idea that > something was wrong with the initial dataset. > hardmask: I never used the feature myself. I wonder if anyone did. Still, > it's a nice idea... > As a heavy masked_array user, I regret not being able to participate more in this discussion as I am madly cranking out matplotlib code. I would like to say that I have always seen masked arrays as being the "next step up" from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts are powerful, and I don't think they have yet to be exploited to their fullest potential. Masks are (relatively) easy when dealing with element-by-element operations that produces an array of the same shape (or at least the same number of elements in the case of reshape and transpose). What gets difficult is for reductions such as sum or max, etc. Then you get into the weirder cases such as unwrap and gradients that I brought up recently. I am not sure how to address this, but I am not a fan of the idea of adding yet another parameter to the ufuncs to determine what to do for filling in a mask. Also, just to make things messier, there is an incomplete feature that was made for record arrays with regards to masking. The idea was to allow for element-by-element masking, but also allow for row-by-row (or was it column-by-column?) masking. I thought it was a neat feature, and it is too bad that it was not finished. Anyway, my opinion is that a mask should be True for a value that needs to be hidden. Do not change this convention. People coming into python already has to change code, a simple bit flip for them should be fine. Breaking existing python code is worse. I also don't see it as entirely necessary for *all* of masked arrays to be brought into numpy core. Only the most important parts/hooks need to be. We could then still have a masked array class that provides the finishing touches such as the sharing of masks and special masked related functions. Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Back to toiling on matplotlib, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Thu Jun 23 21:00:07 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 03:00:07 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: <72008329-9B02-452B-B74F-91883FEF95E5@gmail.com> On Jun 24, 2011, at 2:42 AM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: > Sorry y'all, I'm just commenting bits by bits: > > "One key problem is a lack of orthogonality with other features, for instance creating a masked array with physical quantities can't be done because both are separate subclasses of ndarray. The only reasonable way to deal with this is to move the mask into the core ndarray." > > Meh. I did try to make it easy to use masked arrays on top of subclasses. There's even some tests in the suite to that effect (test_subclassing). I'm not buying the argument. > About moving mask in the core ndarray: I had suggested back in the days to have a mask flag/property built-in ndarrays (which would *really* have simplified the game), but this suggestion was dismissed very quickly as adding too much overload. I had to agree. I'm just a tad surprised the wind has changed on that matter. > > Ok, I'll have to change that section then. :) > > I don't remember seeing mention of this ability in the documentation, but I may not have been reading closely enough for that part. Or played with it ;) > > "In the current masked array, calculations are done for the whole array, then masks are patched up afterwords. This means that invalid calculations sitting in masked elements can raise warnings or exceptions even though they shouldn't, so the ufunc error handling mechanism can't be relied on." > > Well, there's a reason for that. Initially, I tried to guess what the mask of the output should be from the mask of the inputs, the objective being to avoid getting NaNs in the C array. That was easy in most cases, but it turned out it wasn't always possible (the `power` one caused me a lot of issues, if I recall correctly). So, for performance issues (to avoid a lot of expensive tests), I fell back on the old concept of "compute them all, they'll be sorted afterwards". > Of course, that's rather clumsy an approach. But it works not too badly when in pure Python. No doubt that a proper C implementation would work faster. > Oh, about using NaNs for invalid data ? Well, can't work with integers. > > In my proposal, NaNs stay as unmasked NaN values, instead of turning into masked values. This is necessary for uniform treatment of all dtypes, but a subclass could override this behavior with an extra mask modification after arithmetic operations. No problem with that... > `mask` property: > Nothing to add to it. It's basically what we have now (except for the opposite convention). > > Working with masked values: > I recall some strong points back in the days for not using None to represent missing values... > Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag like we have now. > > I'm not really a fan of all the global state that NumPy keeps, I guess I'm trying to stamp that out bit by bit as well where I can... Pretty convenient to define a default once for all, though. > Design questions: > Adding `masked` or whatever we call it to a number/array should result is masked/a fully masked array, period. That way, we can have an idea that something was wrong with the initial dataset. > > I'm not sure I understand what you mean, in the design adding a mask means setting "a.mask = True", "a.mask = False", or "a.mask = " in general. I mean that: 0 + ma.masked = ma.masked ma.array([1,2,3], mask=False) + ma.masked = ma.array([1,2,3], mask=[True,True,True]) By extension, any operation involving a masked value should result in a masked value. > hardmask: I never used the feature myself. I wonder if anyone did. Still, it's a nice idea... > > Ok, I'll leave that out of the initial design unless someone comes up with some strong use cases. Oh, it doesn't eat bread (as we say in French), so you can leave it where it is... From pgmdevlist at gmail.com Thu Jun 23 21:09:42 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 03:09:42 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Jun 24, 2011, at 2:56 AM, Benjamin Root wrote: > > As a heavy masked_array user, I regret not being able to participate more in this discussion as I am madly cranking out matplotlib code. I would like to say that I have always seen masked arrays as being the "next step up" from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts are powerful, and I don't think they have yet to be exploited to their fullest potential. Sweet, a user ! > > Also, just to make things messier, there is an incomplete feature that was made for record arrays with regards to masking. The idea was to allow for element-by-element masking, but also allow for row-by-row (or was it column-by-column?) masking. I thought it was a neat feature, and it is too bad that it was not finished. You can still do that with a structured masked array: you can mask one element, or a whole record, or a whole field. >>> x=np.ma.array([(1,1.),(2,2.),(3,3.)], dtype=[('a',int),('b',float)]) >>> x.mask=[1,0,0] >>>x['b'] = np.ma.masked True, I never bothered porting it to MaskedRecords, because I don't really see the point of records in the first place... > Anyway, my opinion is that a mask should be True for a value that needs to be hidden. Do not change this convention. People coming into python already has to change code, a simple bit flip for them should be fine. Breaking existing python code is worse. +1 > I also don't see it as entirely necessary for *all* of masked arrays to be brought into numpy core. Only the most important parts/hooks need to be. We could then still have a masked array class that provides the finishing touches such as the sharing of masks and special masked related functions. +1 > > Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). > > Back to toiling on matplotlib, G'luck w/ that :) P. From njs at pobox.com Thu Jun 23 21:32:00 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jun 2011 18:32:00 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: >> It's should also be possible to accomplish a general solution at the >> dtype level. We could have a 'dtype factory' used like: >> ?np.zeros(10, dtype=np.maybe(float)) >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize >> + 1, where the extra byte is used to store missingness information. >> (There might be some annoying alignment issues to deal with.) Then for >> each ufunc we define a handler for the maybe dtype (or add a >> special-case to the ufunc dispatch machinery) that checks the >> missingness value and then dispatches to the ordinary ufunc handler >> for the wrapped dtype. > > The 'dtype factory' idea builds on the way I've structured datetime as a > parameterized type, but the thing that kills it for me is the alignment > problems of 'x.itemsize + 1'. Having the mask in a separate memory block is > a lot better than having to store 16 bytes for an 8-byte int to preserve the > alignment. Hmm. I'm not convinced that this is the best approach either, but let me play devil's advocate. The disadvantage of this approach is that masked arrays would effectively have a 100% memory overhead over regular arrays, as opposed to the "shadow mask" approach where the memory overhead is 12.5%--100% depending on the size of objects being stored. Probably the most common case is arrays of doubles, in which case it's 100% versus 12.5%. So that sucks. But on the other hand, we gain: -- simpler implementation: no need to be checking and tracking the mask buffer everywhere. The needed infrastructure is already built in. -- simpler conceptually: we already have the dtype concept, it's a very powerful and we use it for all sorts of things; using it here too plays to our strengths. We already know what a numpy scalar is and how it works. Everyone already understands how assigning a value to an element of an array works, how it interacts with broadcasting, etc., etc., and in this model, that's all a missing value is -- just another value. -- it composes better with existing functionality: for example, someone mentioned the distinction between a missing field inside a record versus a missing record. In this model, that would just be the difference between dtype([("x", maybe(float))]) and maybe(dtype([("x", float)])). Optimization is important and all, but so is simplicity and robustness. That's why we're using Python in the first place :-). If we think that the memory overhead for floating point types is too high, it would be easy to add a special case where maybe(float) used a distinguished NaN instead of a separate boolean. The extra complexity would be isolated to the 'maybe' dtype's inner loop functions, and transparent to the Python level. (Implementing a similar optimization for the masking approach would be really nasty.) This would change the overhead comparison to 0% versus 12.5% in favor of the dtype approach. -- Nathaniel From njs at pobox.com Thu Jun 23 21:41:38 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jun 2011 18:41:38 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: > Lastly, I am not entirely familiar with R, so I am also very curious about > what this magical "NA" value is, and how it compares to how NaNs work. > Although, Pierre brought up the very good point that NaNs woulldn't work > anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless"). That's basically it. -- Nathaniel From ben.root at ou.edu Thu Jun 23 21:52:32 2011 From: ben.root at ou.edu (Benjamin Root) Date: Thu, 23 Jun 2011 20:52:32 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 8:41 PM, Nathaniel Smith wrote: > Internally, there are > a bunch of different NA values -- for floats it's a particular NaN, > for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. So, Bender runs on R? (Sorry, couldn't resist. Futurama season premiere is tonight) Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From srean.list at gmail.com Fri Jun 24 03:35:15 2011 From: srean.list at gmail.com (srean) Date: Fri, 24 Jun 2011 02:35:15 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator Message-ID: Hi, I have a iterator that yields a complex object. I want to make an array out of a numerical attribute that the yielded object possesses and that too very efficiently. My initial plan was to keep writing the numbers to a StringIO object and when done generate the numpy array using StringIO's buffer. But fromstring() method fails on the StringIO object, so does fromfile() or using the StringIO object as the initializer of an array object. After some digging I ran in to this ticket htt projects.scipy.org/numpy/ticket/1634 that has been assigned a low priority. Is there some other way to achieve what I am trying ? Efficiency is important because potentially millions of objects would be yielded. -- srean -------------- next part -------------- An HTML attachment was scrubbed... URL: From srean.list at gmail.com Fri Jun 24 05:03:01 2011 From: srean.list at gmail.com (srean) Date: Fri, 24 Jun 2011 04:03:01 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator In-Reply-To: References: Message-ID: To answer my own question, I guess I can keep appending to a array.array() object and get a numpy.array from its buffer if possible. Is that the efficient way. On Fri, Jun 24, 2011 at 2:35 AM, srean wrote: > Hi, > > I have an iterator that yields a complex object. I want to make an array > out of a numerical attribute that the yielded object possesses and that too > very efficiently. > > My initial plan was to keep writing the numbers to a StringIO object and > when done generate the numpy array using StringIO's buffer. But fromstring() > method fails on the StringIO object, so does fromfile() or using the > StringIO object as the initializer of an array object. > > After some digging I ran in to this ticket htt > projects.scipy.org/numpy/ticket/1634 that has been assigned a low > priority. > > Is there some other way to achieve what I am trying ? Efficiency is > important because potentially millions of objects would be yielded. > > -- srean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Fri Jun 24 07:29:29 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 24 Jun 2011 07:29:29 -0400 Subject: [Numpy-discussion] A bit of advice? References: Message-ID: josef.pktd at gmail.com wrote: > On Thu, Jun 23, 2011 at 8:20 AM, Neal Becker wrote: >> Olivier Delalleau wrote: >> >>> What about : >>> dict((k, [e for e in arr if (e['x0'], e['x1']) == k]) for k in cases) >>> ? >> >> Not bad! Thanks! >> >> BTW, is there an easier way to get the unique keys, then this: >> >> cases = tuple (set (tuple((e['a'],e['b'])) for e in u)) > > I think you can just combine these 2 > > experiments = defaultdict([]) #syntax ? > > for i, e in enumerate(arr): > experiments[tuple((e['a'],e['b']))].append(i) > #experiments[tuple((e['a'],e['b']))].append(y['c']) #or just > summarize results > > experiments.keys() #uniques > > (just typed not checked) > > Josef > Yes, thanks, this works quite nicely: experiments = defaultdict(list) for e in arr: experiments[tuple((e['a'],e['b']))].append(e) From ndbecker2 at gmail.com Fri Jun 24 07:46:10 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 24 Jun 2011 07:46:10 -0400 Subject: [Numpy-discussion] Pleasantly surprised by recarray Message-ID: I was pleasantly surprised to find that recarrays can have a recursive structure, and can be defined using a nice syntax: dtype=[('deltaf', float),('eq', bool),('filt', [('pulse', 'a10'), ('alpha', float)])] However, this I discovered just by guessing. It would be good to mention this in the reference doc. BTW, one unfortunate side effect of using recarrays with elements that are not simple types is that identity comparison seems to fail. That is, when addressing my previous problem for e in arr: experiments[tuple((e['a'],e['b']))].append(e) That would apparantly need to be: for e in array: experiments[e['deltaf'], tuple(e['filt'])].append(e) Unless e['filt'] is converted to tuple, it's type is: type(cases[0][4]) Out[55]: which does not seem to identity compare as I would want. From matthew.brett at gmail.com Fri Jun 24 07:47:45 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 12:47:45 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: Hi, On Thu, Jun 23, 2011 at 10:44 PM, Robert Kern wrote: > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: >> Enthought has asked me to look into the "missing data" problem and how NumPy >> could treat it better. I've considered the different ideas of adding dtype >> variants with a special signal value and masked arrays, and concluded that >> adding masks to the core ndarray appears is the best way to deal with the >> problem in general. >> I've written a NEP that proposes a particular design, viewable here: >> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> There are some questions at the bottom of the NEP which definitely need >> discussion to find the best design choices. Please read, and let me know of >> all the errors and gaps you find in the document. > > One thing that could use more explanation is how your proposal > improves on the status quo, i.e. numpy.ma. As far as I can see, you > are mostly just shuffling around the functionality that already > exists. There has been a continual desire for something like R's NA > values by people who are very familiar with both R and numpy's masked > arrays. Both have their uses, and as Nathaniel points out, R's > approach seems to be very well-liked by a lot of users. In essence, > *that's* the "missing data problem" that you were charged with: making > happy the users who are currently dissatisfied with masked arrays. It > doesn't seem to me that moving the functionality from numpy.ma to > numpy.ndarray resolves any of their issues. Maybe it would help if you could say specifically which issues you think are not being addressed? Or was this more in the way of a 'please speak up'? Cheers, Matthew From matthew.brett at gmail.com Fri Jun 24 07:59:49 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 12:59:49 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: Hi, On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith wrote: ... > If we think that the memory overhead for floating point types is too > high, it would be easy to add a special case where maybe(float) used a > distinguished NaN instead of a separate boolean. The extra complexity > would be isolated to the 'maybe' dtype's inner loop functions, and > transparent to the Python level. (Implementing a similar optimization > for the masking approach would be really nasty.) This would change the > overhead comparison to 0% versus 12.5% in favor of the dtype approach. Can I take this chance to ask Mark a bit more about the problems he sees for the dtypes with missing values? That is have a np.float64_with_missing np.int32_with_missing type dtypes. I see in your NEP you say 'The trouble with this approach is that it requires a large amount of special case code in each data type, and writing a new data type supporting missing data requires defining a mechanism for a special signal value which may not be possible in general.' Just to be clear, you are saying that that, for each dtype, there needs to be some code doing: missing_value = dtype.missing_value then, in loops: if val[here] == missing_value: do_something() and the fact that 'missing_value' could be any type would make the code more complicated than the current case where the mask is always bools or something? Nathaniel's point about reduction in storage needed for the mask to 0 is surely significant if we want numpy to be the best choice for big data. You mention that it would be good to allow masking for any new dtype - is that a practical problem? I mean, how many people will in fact have the combination of a) need of masking b) need of custom dtype, and c) lack of time or expertise to implement masking for that type? Thanks a lot for the proposal and the discussion, Matthew From lgautier at gmail.com Fri Jun 24 08:30:07 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Fri, 24 Jun 2011 14:30:07 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E04834F.6060708@gmail.com> On 2011-06-24 13:59, Nathaniel Smith wrote: > On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: >> Lastly, I am not entirely familiar with R, so I am also very curious about >> what this magical "NA" value is, and how it compares to how NaNs work. >> Although, Pierre brought up the very good point that NaNs woulldn't work >> anyway with integer arrays (and object arrays, etc.). > Since R is designed for statistics, they made the interesting decision > that *all* of their core types have a special designated "missing" > value. At the R level this is just called "NA". Internally, there are > a bunch of different NA values -- for floats it's a particular NaN, > for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never > notice this, because R will silently cast a NA of one type into NA of > another type whenever needed, and they all print the same.) > > Because any array can contain NA's, all R functions then have to have > some way of handling this -- all their integer arithmetic knows that > INT_MIN is special, for instance. The rules are basically the same as > for NaN's, but NA and NaN are different from each other (because one > means "I don't know, could be anything" and the other means "you tried > to divide by 0, I *know* that's meaningless"). > > That's basically it. > > -- Nathaniel Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ? Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell. If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust. L. PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN. From ndbecker2 at gmail.com Fri Jun 24 09:01:22 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Fri, 24 Jun 2011 09:01:22 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray References: Message-ID: Just 1 question before I look more closely. What is the cost to the non-MA user of this addition? From charlesr.harris at gmail.com Fri Jun 24 09:33:02 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 07:33:02 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04834F.6060708@gmail.com> References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 6:30 AM, Laurent Gautier wrote: > On 2011-06-24 13:59, Nathaniel Smith wrote: > > On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: > >> Lastly, I am not entirely familiar with R, so I am also very curious > about > >> what this magical "NA" value is, and how it compares to how NaNs work. > >> Although, Pierre brought up the very good point that NaNs woulldn't work > >> anyway with integer arrays (and object arrays, etc.). > > Since R is designed for statistics, they made the interesting decision > > that *all* of their core types have a special designated "missing" > > value. At the R level this is just called "NA". Internally, there are > > a bunch of different NA values -- for floats it's a particular NaN, > > for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never > > notice this, because R will silently cast a NA of one type into NA of > > another type whenever needed, and they all print the same.) > > > > Because any array can contain NA's, all R functions then have to have > > some way of handling this -- all their integer arithmetic knows that > > INT_MIN is special, for instance. The rules are basically the same as > > for NaN's, but NA and NaN are different from each other (because one > > means "I don't know, could be anything" and the other means "you tried > > to divide by 0, I *know* that's meaningless"). > > > > That's basically it. > > > > -- Nathaniel > > Would the use of R's system for expressing "missing values" be possible > in numpy through a special flag ? > > Any given numpy array could have a boolean flag (say "na_aware") > indicating that some of the values are representing a missing cell. > > If the exact same system is used, interaction with R (through something > like rpy2) would be simplified and more robust. > Interesting thought. Doing that could be handled by adding r-dtypes, as in r-float32, r-int, etc. However, adding so many dtypes with different behaviors could make for a messy implementation, whereas masks would be uniform across types. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Fri Jun 24 09:57:04 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Fri, 24 Jun 2011 06:57:04 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman wrote: >> >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: >> > Enthought has asked me to look into the "missing data" problem and how >> > NumPy >> > could treat it better. I've considered the different ideas of adding >> > dtype >> > variants with a special signal value and masked arrays, and concluded >> > that >> > adding masks to the core ndarray appears is the best way to deal with >> > the >> > problem in general. >> > I've written a NEP that proposes a particular design, viewable here: >> > >> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >> > There are some questions at the bottom of the NEP which definitely need >> > discussion to find the best design choices. Please read, and let me know >> > of >> > all the errors and gaps you find in the document. >> >> Wow, that is exciting. >> >> I wonder about the relative performance of the two possible >> implementations (mask and NA) in the PEP. > > I've given that some thought, and I don't think there's a clear way to tell > what the performance gap would be without implementations of both to > benchmark against each other. I favor the mask primarily because it provides > masking for all data types in one go with a single consistent interface to > program against. For adding NA signal values, each new data type would need > a lot of work to gain the same level of support. >> >> If you are, say, doing a calculation along the columns of a 2d array >> one element at a time, then you will need to grab an element from the >> array and grab the corresponding element from the mask. I assume the >> corresponding data and mask elements are not stored together. That >> would be slow since memory access is usually were time is spent. In >> this regard NA would be faster. > > Yes, the masks add more memory traffic and some extra calculation, while the > NA signal values just require some additional calculations. I guess a better example would have been summing along rows instead of columns of a large C order array. If one needs to look at both the data and the mask then wouldn't summing along rows in cython be about as slow as it is currently to sum along columns? >> I currently use NaN as a missing data marker. That adds things like >> this to my cython code: >> >> ? ?if a[i] == a[i]: >> ? ? ? ?asum += a[i] >> >> If NA also had the property NA == NA is False, then it would be easy >> to use. > > That's what I believe it should do, and I guess this is a strike against the > idea of returning None for a single missing value. If NA == NA is False then I wouldn't need to look at the mask in the example above. Or would ndarray have to look at the mask in order to return NA for a[i]? Which would mean __getitem__ would need to look at the mask? If the missing value is returned as a 0d array (so that NA == NA is False), would that break cython in a fundamental way since it could not always return a same-sized scalar when you index into an array? >> A mask, on the other hand, would be more difficult for third >> party packages to support. You have to check if the mask is present >> and if so do a mask-aware calculation; if is it not present then you >> have to do a non-mask based calculation. > > I actually see the mask as being easier for third party packages to support, > particularly from C. Having regular C-friendly values with a boolean mask is > a lot friendlier than values that require a lot of special casing like the > NA signal values would require. > >> >> So you have two code paths. >> You also need to check if any of the input arrays have masks and if so >> apply masks to the other inputs, etc. > > Most of the time, the masks will transparently propagate or not along with > the arrays, with no effort required. In Python, the code you write would be > virtually the same between the two approaches. > -Mark > >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From robert.kern at gmail.com Fri Jun 24 10:01:15 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:01:15 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 06:47, Matthew Brett wrote: > Hi, > > On Thu, Jun 23, 2011 at 10:44 PM, Robert Kern wrote: >> On Thu, Jun 23, 2011 at 15:53, Mark Wiebe wrote: >>> Enthought has asked me to look into the "missing data" problem and how NumPy >>> could treat it better. I've considered the different ideas of adding dtype >>> variants with a special signal value and masked arrays, and concluded that >>> adding masks to the core ndarray appears is the best way to deal with the >>> problem in general. >>> I've written a NEP that proposes a particular design, viewable here: >>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>> There are some questions at the bottom of the NEP which definitely need >>> discussion to find the best design choices. Please read, and let me know of >>> all the errors and gaps you find in the document. >> >> One thing that could use more explanation is how your proposal >> improves on the status quo, i.e. numpy.ma. As far as I can see, you >> are mostly just shuffling around the functionality that already >> exists. There has been a continual desire for something like R's NA >> values by people who are very familiar with both R and numpy's masked >> arrays. Both have their uses, and as Nathaniel points out, R's >> approach seems to be very well-liked by a lot of users. In essence, >> *that's* the "missing data problem" that you were charged with: making >> happy the users who are currently dissatisfied with masked arrays. It >> doesn't seem to me that moving the functionality from numpy.ma to >> numpy.ndarray resolves any of their issues. > > Maybe it would help if you could say specifically which issues you > think are not being addressed? ?Or was this more in the way of a > 'please speak up'? More the latter. Any proposal that purports to replace numpy.ma ought to at least *mention* it. I think it's a fine proposal for a system de novo, but it's not de novo. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 24 10:06:30 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:06:30 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04834F.6060708@gmail.com> References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote: > On 2011-06-24 13:59, ?Nathaniel Smith wrote: >> On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root ?wrote: >>> Lastly, I am not entirely familiar with R, so I am also very curious about >>> what this magical "NA" value is, and how it compares to how NaNs work. >>> Although, Pierre brought up the very good point that NaNs woulldn't work >>> anyway with integer arrays (and object arrays, etc.). >> Since R is designed for statistics, they made the interesting decision >> that *all* of their core types have a special designated "missing" >> value. At the R level this is just called "NA". Internally, there are >> a bunch of different NA values -- for floats it's a particular NaN, >> for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never >> notice this, because R will silently cast a NA of one type into NA of >> another type whenever needed, and they all print the same.) >> >> Because any array can contain NA's, all R functions then have to have >> some way of handling this -- all their integer arithmetic knows that >> INT_MIN is special, for instance. The rules are basically the same as >> for NaN's, but NA and NaN are different from each other (because one >> means "I don't know, could be anything" and the other means "you tried >> to divide by 0, I *know* that's meaningless"). >> >> That's basically it. >> >> -- Nathaniel > > Would the use of R's system for expressing "missing values" be possible > in numpy through a special flag ? > > Any given numpy array could have a boolean flag (say "na_aware") > indicating that some of the values are representing a missing cell. > > If the exact same system is used, interaction with R (through something > like rpy2) would be simplified and more robust. The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 24 10:12:18 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:12:18 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 04:03, srean wrote: > To answer my own question, I guess I can keep appending to a array.array() > object and get a numpy.array from its buffer if possible.? Is that the > efficient way. It's one of the most efficient ways to do it, yes, especially for 1D arrays. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From kwgoodman at gmail.com Fri Jun 24 10:24:04 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Fri, 24 Jun 2011 07:24:04 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: > The alternative proposal would be to add a few new dtypes that are > NA-aware. E.g. an nafloat64 would reserve a particular NaN value > (there are lots of different NaN bit patterns, we'd just reserve one) > that would represent NA. An naint32 would probably reserve the most > negative int32 value (like R does). Using the NA-aware dtypes signals > that you are using NA values; there is no need for an additional flag. I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice. From bsouthey at gmail.com Fri Jun 24 10:27:55 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 24 Jun 2011 09:27:55 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: <4E049EEB.1060203@gmail.com> On 06/24/2011 09:06 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote: >> On 2011-06-24 13:59, Nathaniel Smith wrote: >>> On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: >>>> Lastly, I am not entirely familiar with R, so I am also very curious about >>>> what this magical "NA" value is, and how it compares to how NaNs work. >>>> Although, Pierre brought up the very good point that NaNs woulldn't work >>>> anyway with integer arrays (and object arrays, etc.). >>> Since R is designed for statistics, they made the interesting decision >>> that *all* of their core types have a special designated "missing" >>> value. At the R level this is just called "NA". Internally, there are >>> a bunch of different NA values -- for floats it's a particular NaN, >>> for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never >>> notice this, because R will silently cast a NA of one type into NA of >>> another type whenever needed, and they all print the same.) >>> >>> Because any array can contain NA's, all R functions then have to have >>> some way of handling this -- all their integer arithmetic knows that >>> INT_MIN is special, for instance. The rules are basically the same as >>> for NaN's, but NA and NaN are different from each other (because one >>> means "I don't know, could be anything" and the other means "you tried >>> to divide by 0, I *know* that's meaningless"). >>> >>> That's basically it. >>> >>> -- Nathaniel >> Would the use of R's system for expressing "missing values" be possible >> in numpy through a special flag ? >> >> Any given numpy array could have a boolean flag (say "na_aware") >> indicating that some of the values are representing a missing cell. >> >> If the exact same system is used, interaction with R (through something >> like rpy2) would be simplified and more robust. > The alternative proposal would be to add a few new dtypes that are > NA-aware. E.g. an nafloat64 would reserve a particular NaN value > (there are lots of different NaN bit patterns, we'd just reserve one) > that would represent NA. An naint32 would probably reserve the most > negative int32 value (like R does). Using the NA-aware dtypes signals > that you are using NA values; there is no need for an additional flag. > There is an very important distinction here between a masked value and a missing value. In some sense, a missing value is a permanently masked value and may be indistinguishable from 'not a number'. But a masked value is not a missing value because with masked arrays the original values still exists. Consequently using a masked value for 'missing-ness' can be reversed by the user changing the mask at any time. That is really the power of masked arrays as you can real missing values but also 'flag' unusual values as missing! Virtually software packages are handling missing values not masked values. So it is really, really important that you clarify what you are proposing because your proposal does mix these two different concepts. As per the missing value discussion, I would think that adding a missing value data type(s) for 'missing values' would be feasible and may be something that numpy should have. But that would not address 'masked values' and probably must be view as independent topic and thread. Below are some sources for missing values in R and SAS. SAS has 28 ways that a user can define numerical values as 'missing values' - not just the dot! While not apparently universal, SAS has missing value codes to handle positive and negative infinity. R does differ between missing values and 'not a number' which, to my knowledge, SAS does not do. This distinction is probably important for masked vs missing values. SAS uses a blank for missing character values but see the two links at the end for more than. This is for R: http://faculty.nps.edu/sebuttre/home/S/missings.html http://www.ats.ucla.edu/stat/r/faq/missing.htm http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001292604.htm This page is a comparison to R: http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer.htm#imlug_r_sect019.htm Some other SAS sources: Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF 072-2011: Special Missing Values for Character Fields - SAS support.*sas*.com/resources/papers/proceedings11/072-2011.pdf Bruce -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 24 10:33:18 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 08:33:18 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote: > > On 2011-06-24 13:59, Nathaniel Smith wrote: > >> On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: > >>> Lastly, I am not entirely familiar with R, so I am also very curious > about > >>> what this magical "NA" value is, and how it compares to how NaNs work. > >>> Although, Pierre brought up the very good point that NaNs woulldn't > work > >>> anyway with integer arrays (and object arrays, etc.). > >> Since R is designed for statistics, they made the interesting decision > >> that *all* of their core types have a special designated "missing" > >> value. At the R level this is just called "NA". Internally, there are > >> a bunch of different NA values -- for floats it's a particular NaN, > >> for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never > >> notice this, because R will silently cast a NA of one type into NA of > >> another type whenever needed, and they all print the same.) > >> > >> Because any array can contain NA's, all R functions then have to have > >> some way of handling this -- all their integer arithmetic knows that > >> INT_MIN is special, for instance. The rules are basically the same as > >> for NaN's, but NA and NaN are different from each other (because one > >> means "I don't know, could be anything" and the other means "you tried > >> to divide by 0, I *know* that's meaningless"). > >> > >> That's basically it. > >> > >> -- Nathaniel > > > > Would the use of R's system for expressing "missing values" be possible > > in numpy through a special flag ? > > > > Any given numpy array could have a boolean flag (say "na_aware") > > indicating that some of the values are representing a missing cell. > > > > If the exact same system is used, interaction with R (through something > > like rpy2) would be simplified and more robust. > > The alternative proposal would be to add a few new dtypes that are > NA-aware. E.g. an nafloat64 would reserve a particular NaN value > (there are lots of different NaN bit patterns, we'd just reserve one) > that would represent NA. An naint32 would probably reserve the most > negative int32 value (like R does). Using the NA-aware dtypes signals > that you are using NA values; there is no need for an additional flag. > > Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Jun 24 10:35:53 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:35:53 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote: > On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: > >> The alternative proposal would be to add a few new dtypes that are >> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >> (there are lots of different NaN bit patterns, we'd just reserve one) >> that would represent NA. An naint32 would probably reserve the most >> negative int32 value (like R does). Using the NA-aware dtypes signals >> that you are using NA values; there is no need for an additional flag. > > I don't understand the numpy design and maintainable issues, but from > a user perspective (mine) nafloat64, etc sounds nice. It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 24 10:43:20 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:43:20 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 09:33, Charles R Harris wrote: > > On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote: >> The alternative proposal would be to add a few new dtypes that are >> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >> (there are lots of different NaN bit patterns, we'd just reserve one) >> that would represent NA. An naint32 would probably reserve the most >> negative int32 value (like R does). Using the NA-aware dtypes signals >> that you are using NA values; there is no need for an additional flag. > > Definitely better names than r-int32. Going this way has the advantage of > reducing the friction between R and numpy, and since R has pretty much > become the standard software for statistics that is an important > consideration. I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that.... -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 24 10:44:00 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 09:44:00 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote: > On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote: >> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: >> >>> The alternative proposal would be to add a few new dtypes that are >>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>> (there are lots of different NaN bit patterns, we'd just reserve one) >>> that would represent NA. An naint32 would probably reserve the most >>> negative int32 value (like R does). Using the NA-aware dtypes signals >>> that you are using NA values; there is no need for an additional flag. >> >> I don't understand the numpy design and maintainable issues, but from >> a user perspective (mine) nafloat64, etc sounds nice. > > It's worth noting that this is not a replacement for masked arrays, > nor is it intended to be the be-all, end-all solution to missing data > problems. It's mostly just intended to be a focused tool to fill in > the gaps where masked arrays are less convenient for whatever reason; > e.g. where you're tempted to (ab)use NaNs for the purpose and the > limitations on the range of values is acceptable. Not every dtype > would have an NA-aware counterpart. I would suggest just nabool, > nafloat64, naint32, nastring (a little tricky due to the flexible > size, but doable), and naobject. Maybe a couple more, if we get > requests, like naint64 and nacomplex128. Oh, and nadatetime64 and natimedelta64. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From charlesr.harris at gmail.com Fri Jun 24 10:46:21 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 08:46:21 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:44 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote: > > On Fri, Jun 24, 2011 at 09:24, Keith Goodman > wrote: > >> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern > wrote: > >> > >>> The alternative proposal would be to add a few new dtypes that are > >>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value > >>> (there are lots of different NaN bit patterns, we'd just reserve one) > >>> that would represent NA. An naint32 would probably reserve the most > >>> negative int32 value (like R does). Using the NA-aware dtypes signals > >>> that you are using NA values; there is no need for an additional flag. > >> > >> I don't understand the numpy design and maintainable issues, but from > >> a user perspective (mine) nafloat64, etc sounds nice. > > > > It's worth noting that this is not a replacement for masked arrays, > > nor is it intended to be the be-all, end-all solution to missing data > > problems. It's mostly just intended to be a focused tool to fill in > > the gaps where masked arrays are less convenient for whatever reason; > > e.g. where you're tempted to (ab)use NaNs for the purpose and the > > limitations on the range of values is acceptable. Not every dtype > > would have an NA-aware counterpart. I would suggest just nabool, > > nafloat64, naint32, nastring (a little tricky due to the flexible > > size, but doable), and naobject. Maybe a couple more, if we get > > requests, like naint64 and nacomplex128. > > Oh, and nadatetime64 and natimedelta64. > > Beat me to it ;) Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgmdevlist at gmail.com Fri Jun 24 11:02:01 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Fri, 24 Jun 2011 17:02:01 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> On Jun 24, 2011, at 4:44 PM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote: >> On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote: >>> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: >>> >>>> The alternative proposal would be to add a few new dtypes that are >>>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>>> (there are lots of different NaN bit patterns, we'd just reserve one) >>>> that would represent NA. An naint32 would probably reserve the most >>>> negative int32 value (like R does). Using the NA-aware dtypes signals >>>> that you are using NA values; there is no need for an additional flag. >>> >>> I don't understand the numpy design and maintainable issues, but from >>> a user perspective (mine) nafloat64, etc sounds nice. >> >> It's worth noting that this is not a replacement for masked arrays, >> nor is it intended to be the be-all, end-all solution to missing data >> problems. It's mostly just intended to be a focused tool to fill in >> the gaps where masked arrays are less convenient for whatever reason; >> e.g. where you're tempted to (ab)use NaNs for the purpose and the >> limitations on the range of values is acceptable. Not every dtype >> would have an NA-aware counterpart. I would suggest just nabool, >> nafloat64, naint32, nastring (a little tricky due to the flexible >> size, but doable), and naobject. Maybe a couple more, if we get >> requests, like naint64 and nacomplex128. > > Oh, and nadatetime64 and natimedelta64. So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. Now, how will masked values represented ? Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know? From lgautier at gmail.com Fri Jun 24 11:07:06 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Fri, 24 Jun 2011 17:07:06 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E04A81A.2080406@gmail.com> On 2011-06-24 16:43, Robert Kern wrote: > On Fri, Jun 24, 2011 at 09:33, Charles R Harris > wrote: >> > >> > On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote: >>> >> The alternative proposal would be to add a few new dtypes that are >>> >> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>> >> (there are lots of different NaN bit patterns, we'd just reserve one) >>> >> that would represent NA. An naint32 would probably reserve the most >>> >> negative int32 value (like R does). Using the NA-aware dtypes signals >>> >> that you are using NA values; there is no need for an additional flag. >> > >> > Definitely better names than r-int32. Going this way has the advantage of >> > reducing the friction between R and numpy, and since R has pretty much >> > become the standard software for statistics that is an important >> > consideration. > I would definitely steal their choices of NA value for naint32 and > nafloat64. I have reservations about their string NA value (i.e. 'NA') > as anyone doing business in North America and other continents may > have issues with that.... May be there is not so much need for reservation over the string NA, when making the distinction between: a- the internal representation of a "missing string" (what is stored in memory, and that C-level code would need to be aware of) b- the 'external' representation of a missing string (in Python, what would be returned by repr() ) c- what is assumed to be a missing string value when reading from a file. a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can be configured as a module-level, class-level, or instance-level variable. From matthew.brett at gmail.com Fri Jun 24 11:07:55 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 16:07:55 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: Hi, On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 09:33, Charles R Harris > wrote: >> >> On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote: > >>> The alternative proposal would be to add a few new dtypes that are >>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>> (there are lots of different NaN bit patterns, we'd just reserve one) >>> that would represent NA. An naint32 would probably reserve the most >>> negative int32 value (like R does). Using the NA-aware dtypes signals >>> that you are using NA values; there is no need for an additional flag. >> >> Definitely better names than r-int32. Going this way has the advantage of >> reducing the friction between R and numpy, and since R has pretty much >> become the standard software for statistics that is an important >> consideration. > > I would definitely steal their choices of NA value for naint32 and > nafloat64. I have reservations about their string NA value (i.e. 'NA') > as anyone doing business in North America and other continents may > have issues with that.... It would certainly help me at least if someone (Mark? sorry to ask...) could set out the implementation and API differences that would result from the two options: 1) array.mask option - an integer array of shape array.shape giving mask (True, False) values for each element 2) nafloat64 option - dtypes with specified dtype-specific missing values Best, Matthew From robert.kern at gmail.com Fri Jun 24 11:14:59 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 10:14:59 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04A81A.2080406@gmail.com> References: <4E04A81A.2080406@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:07, Laurent Gautier wrote: > On 2011-06-24 16:43, Robert Kern wrote: >> >> On Fri, Jun 24, 2011 at 09:33, Charles R Harris >> wrote: >>> >>> > >>> > ?On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern >>> > ?wrote: >>>> >>>> >> ?The alternative proposal would be to add a few new dtypes that are >>>> >> ?NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>>> >> ?(there are lots of different NaN bit patterns, we'd just reserve >>>> >> one) >>>> >> ?that would represent NA. An naint32 would probably reserve the most >>>> >> ?negative int32 value (like R does). Using the NA-aware dtypes >>>> >> signals >>>> >> ?that you are using NA values; there is no need for an additional >>>> >> flag. >>> >>> > >>> > ?Definitely better names than r-int32. Going this way has the advantage >>> > of >>> > ?reducing the friction between R and numpy, and since R has pretty much >>> > ?become the standard software for statistics that is an important >>> > ?consideration. >> >> I would definitely steal their choices of NA value for naint32 and >> nafloat64. I have reservations about their string NA value (i.e. 'NA') >> as anyone doing business in North America and other continents may >> have issues with that.... > > May be there is not so much need for reservation over the string NA, when > making the distinction between: > a- the internal representation of a "missing string" (what is stored in > memory, and that C-level code would need to be aware of) > b- the 'external' representation of a missing string (in Python, what would > be returned by repr() ) > c- what is assumed to be a missing string value when reading from a file. > > a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can > be configured as a module-level, class-level, or instance-level variable. In R, a/ happens to be 'NA', unfortunately. :-/ I'm not really sure how they handle datasets that use valid 'NA' values. Presumably, their input routines allow one to convert such values to something else such that it can use 'NA'==NA internally. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Fri Jun 24 11:30:32 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 10:30:32 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> References: <4E04834F.6060708@gmail.com> <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:02, Pierre GM wrote: > > On Jun 24, 2011, at 4:44 PM, Robert Kern wrote: > >> On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote: >>> On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote: >>>> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: >>>> >>>>> The alternative proposal would be to add a few new dtypes that are >>>>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>>>> (there are lots of different NaN bit patterns, we'd just reserve one) >>>>> that would represent NA. An naint32 would probably reserve the most >>>>> negative int32 value (like R does). Using the NA-aware dtypes signals >>>>> that you are using NA values; there is no need for an additional flag. >>>> >>>> I don't understand the numpy design and maintainable issues, but from >>>> a user perspective (mine) nafloat64, etc sounds nice. >>> >>> It's worth noting that this is not a replacement for masked arrays, >>> nor is it intended to be the be-all, end-all solution to missing data >>> problems. It's mostly just intended to be a focused tool to fill in >>> the gaps where masked arrays are less convenient for whatever reason; >>> e.g. where you're tempted to (ab)use NaNs for the purpose and the >>> limitations on the range of values is acceptable. Not every dtype >>> would have an NA-aware counterpart. I would suggest just nabool, >>> nafloat64, naint32, nastring (a little tricky due to the flexible >>> size, but doable), and naobject. Maybe a couple more, if we get >>> requests, like naint64 and nacomplex128. >> >> Oh, and nadatetime64 and natimedelta64. > > So, if I understand correctly: > if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? Not quite; there are no separate mask arrays with this approach. It's more akin to using NaNs to represent missing values except with more rigor. NA values won't be "accidentally" created from computations from non-NA values. > And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. > Now, how will masked values represented ? For the float types, we use a particular NaN bit-pattern (we'll steal R's choice). For the int types, we use the most negative number. For strings, R uses 'NA', but I'd *like* to use something less likely to conflict with actual use. For the date/time types, we would reserve a value close to the NaT value. For objects, we would have a singleton created specifically for this purpose. bools, which are internally represented by a uint8, will use 2. > Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know? I would suggest following R's lead and letting ((NA==NA) == True) unlike NaNs. Each NA-aware scalar type would have a class attribute giving its NA value: if a[0] == nafloat64.NA: ... good_values = (a != nafloat64.NA) You could possibly make a general NA object with smart comparison methods that will inspect the dtype of the other object so you don't have to know the dtype in your code, but that's a little magic. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From mwwiebe at gmail.com Fri Jun 24 11:40:21 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 10:40:21 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root wrote: > On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: > >> Sorry y'all, I'm just commenting bits by bits: >> >> "One key problem is a lack of orthogonality with other features, for >> instance creating a masked array with physical quantities can't be done >> because both are separate subclasses of ndarray. The only reasonable way to >> deal with this is to move the mask into the core ndarray." >> >> Meh. I did try to make it easy to use masked arrays on top of subclasses. >> There's even some tests in the suite to that effect (test_subclassing). I'm >> not buying the argument. >> About moving mask in the core ndarray: I had suggested back in the days to >> have a mask flag/property built-in ndarrays (which would *really* have >> simplified the game), but this suggestion was dismissed very quickly as >> adding too much overload. I had to agree. I'm just a tad surprised the wind >> has changed on that matter. >> >> >> "In the current masked array, calculations are done for the whole array, >> then masks are patched up afterwords. This means that invalid calculations >> sitting in masked elements can raise warnings or exceptions even though they >> shouldn't, so the ufunc error handling mechanism can't be relied on." >> >> Well, there's a reason for that. Initially, I tried to guess what the mask >> of the output should be from the mask of the inputs, the objective being to >> avoid getting NaNs in the C array. That was easy in most cases, but it >> turned out it wasn't always possible (the `power` one caused me a lot of >> issues, if I recall correctly). So, for performance issues (to avoid a lot >> of expensive tests), I fell back on the old concept of "compute them all, >> they'll be sorted afterwards". >> Of course, that's rather clumsy an approach. But it works not too badly >> when in pure Python. No doubt that a proper C implementation would work >> faster. >> Oh, about using NaNs for invalid data ? Well, can't work with integers. >> >> `mask` property: >> Nothing to add to it. It's basically what we have now (except for the >> opposite convention). >> >> Working with masked values: >> I recall some strong points back in the days for not using None to >> represent missing values... >> Adding a maskedstr argument to array2string ? Mmh... I prefer a global >> flag like we have now. >> >> Design questions: >> Adding `masked` or whatever we call it to a number/array should result is >> masked/a fully masked array, period. That way, we can have an idea that >> something was wrong with the initial dataset. >> hardmask: I never used the feature myself. I wonder if anyone did. Still, >> it's a nice idea... >> > > As a heavy masked_array user, I regret not being able to participate more > in this discussion as I am madly cranking out matplotlib code. I would like > to say that I have always seen masked arrays as being the "next step up" > from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts > are powerful, and I don't think they have yet to be exploited to their > fullest potential. > Do you have some examples where hardmask or sharedmask are being used? I like the idea of using a hardmask array as the return value for boolean indexing, but some more use cases would be nice. > Masks are (relatively) easy when dealing with element-by-element operations > that produces an array of the same shape (or at least the same number of > elements in the case of reshape and transpose). What gets difficult is for > reductions such as sum or max, etc. Then you get into the weirder cases > such as unwrap and gradients that I brought up recently. I am not sure how > to address this, but I am not a fan of the idea of adding yet another > parameter to the ufuncs to determine what to do for filling in a mask. > It looks like in R there is a parameter called na.rm=T/F, which basically means "remove NAs before doing the computation". This approach seems good to me for reduction operations. Also, just to make things messier, there is an incomplete feature that was > made for record arrays with regards to masking. The idea was to allow for > element-by-element masking, but also allow for row-by-row (or was it > column-by-column?) masking. I thought it was a neat feature, and it is too > bad that it was not finished. > I put this in my design, I think this would be useful too. I would call it field by field, though many people like thinking of the struct dtype fields as columns. Anyway, my opinion is that a mask should be True for a value that needs to > be hidden. Do not change this convention. People coming into python > already has to change code, a simple bit flip for them should be fine. > Breaking existing python code is worse. > I'm now thinking the mask needs to be pushed away into the background to where it becomes be an unimportant implementation detail of the system. It deserves a long cumbersome name like "validitymask", and then the system can use something close R's approach with an NA-like singleton for most operations. > I also don't see it as entirely necessary for *all* of masked arrays to be > brought into numpy core. Only the most important parts/hooks need to be. > We could then still have a masked array class that provides the finishing > touches such as the sharing of masks and special masked related functions. > That's reasonable, yeah. Lastly, I am not entirely familiar with R, so I am also very curious about > what this magical "NA" value is, and how it compares to how NaNs work. > Although, Pierre brought up the very good point that NaNs woulldn't work > anyway with integer arrays (and object arrays, etc.). > It's similar to NaN, but has the interpretation Nathanial pointed out, where there is a valid value but it's unknown. A consequence of that is that logical_and(, False) is False, something which behaves differently from NaNs. -Mark > > Back to toiling on matplotlib, > Ben Root > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jun 24 12:05:18 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 09:05:18 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04A81A.2080406@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:14 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 10:07, Laurent Gautier wrote: >> May be there is not so much need for reservation over the string NA, when >> making the distinction between: >> a- the internal representation of a "missing string" (what is stored in >> memory, and that C-level code would need to be aware of) >> b- the 'external' representation of a missing string (in Python, what would >> be returned by repr() ) >> c- what is assumed to be a missing string value when reading from a file. >> >> a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can >> be configured as a module-level, class-level, or instance-level variable. > > In R, a/ happens to be 'NA', unfortunately. :-/ > > I'm not really sure how they handle datasets that use valid 'NA' > values. Presumably, their input routines allow one to convert such > values to something else such that it can use 'NA'==NA internally. No, R can distinguish the string "NA" and the value NA-of-type-string: > c("NA", NA) [1] "NA" NA In R strings are represented as pointers, rather than in-place, and the magic NA value has a special globally known pointer value. (This pointer might well point to the characters "NA\0", but all of the code knows to check whether it has the magic NA pointer before actually following the pointer.) -- Nathaniel From robert.kern at gmail.com Fri Jun 24 12:07:22 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 11:07:22 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04A81A.2080406@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 11:05, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 8:14 AM, Robert Kern wrote: >> On Fri, Jun 24, 2011 at 10:07, Laurent Gautier wrote: >>> May be there is not so much need for reservation over the string NA, when >>> making the distinction between: >>> a- the internal representation of a "missing string" (what is stored in >>> memory, and that C-level code would need to be aware of) >>> b- the 'external' representation of a missing string (in Python, what would >>> be returned by repr() ) >>> c- what is assumed to be a missing string value when reading from a file. >>> >>> a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can >>> be configured as a module-level, class-level, or instance-level variable. >> >> In R, a/ happens to be 'NA', unfortunately. :-/ >> >> I'm not really sure how they handle datasets that use valid 'NA' >> values. Presumably, their input routines allow one to convert such >> values to something else such that it can use 'NA'==NA internally. > > No, R can distinguish the string "NA" and the value NA-of-type-string: > >> c("NA", NA) > [1] "NA" NA > > In R strings are represented as pointers, rather than in-place, and > the magic NA value has a special globally known pointer value. (This > pointer might well point to the characters "NA\0", but all of the code > knows to check whether it has the magic NA pointer before actually > following the pointer.) Ah, okay. Well, then we can pick whatever value we like. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From njs at pobox.com Fri Jun 24 12:11:40 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 09:11:40 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:30 AM, Robert Kern wrote: > I would suggest following R's lead and letting ((NA==NA) == True) > unlike NaNs. In R, NA and NaN do behave differently with respect to ==, but not the way you're saying: > NA == NA [1] NA > if (NA == NA) 1; Error in if (NA == NA) 1 : missing value where TRUE/FALSE needed This again is consistent with the semantics that NA represents some unknown concrete value -- depending on what actual values, NA == NA might or might not be true, we don't know. So it's NA as well. -- Nathaniel From Chris.Barker at noaa.gov Fri Jun 24 12:13:51 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Fri, 24 Jun 2011 09:13:51 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E04B7BF.5030300@noaa.gov> Nathaniel Smith wrote: >> The 'dtype factory' idea builds on the way I've structured datetime as a >> parameterized type, ... Another disadvantage is that we get further from Gael Varoquaux's point: >> Right now, the numpy array can be seen as an extension of the C >> array, basically a pointer, a data type, and a shape (and strides). >> This enables easy sharing with libraries that have not been >> written with numpy in mind. and also PEP 3118 support It is very useful that a numpy array has a pointer to a regular old C array -- if we introduce this special dtype, that will break (well, not really, put the the c array would be of this particular struct). Granted, any other C code would properly have to do something with the mask anyway, but I still think it'd be better to keep that raw data array standard. This applies to switching between masked and not-masked numpy arrays also -- I don't think I'd want the performance hot of that requiring a data copy. Also the idea was posted here that you could use views to have the same data set with different masks -- that would break as well. Nathaniel Smith wrote: > If we think that the memory overhead for floating point types is too > high, it would be easy to add a special case where maybe(float) used a > distinguished NaN instead of a separate boolean. That would be pretty cool, though in the past folks have made a good argument that even for floats, masks have significant advantages over "just using NaN". One might be that you can mask and unmask a value for different operations, without losing the value. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From mwwiebe at gmail.com Fri Jun 24 12:21:01 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:21:01 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <72008329-9B02-452B-B74F-91883FEF95E5@gmail.com> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <72008329-9B02-452B-B74F-91883FEF95E5@gmail.com> Message-ID: On Thu, Jun 23, 2011 at 8:00 PM, Pierre GM wrote: > > On Jun 24, 2011, at 2:42 AM, Mark Wiebe wrote: > > > On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: > > Sorry y'all, I'm just commenting bits by bits: > > > > "One key problem is a lack of orthogonality with other features, for > instance creating a masked array with physical quantities can't be done > because both are separate subclasses of ndarray. The only reasonable way to > deal with this is to move the mask into the core ndarray." > > > > Meh. I did try to make it easy to use masked arrays on top of subclasses. > There's even some tests in the suite to that effect (test_subclassing). I'm > not buying the argument. > > About moving mask in the core ndarray: I had suggested back in the days > to have a mask flag/property built-in ndarrays (which would *really* have > simplified the game), but this suggestion was dismissed very quickly as > adding too much overload. I had to agree. I'm just a tad surprised the wind > has changed on that matter. > > > > Ok, I'll have to change that section then. :) > > > > I don't remember seeing mention of this ability in the documentation, but > I may not have been reading closely enough for that part. > > Or played with it ;) True, I haven't played with it all that much, but the amount I've used it and the amount I've wrestled with it during 1.6 development certainly make me feel I know something about it. ;) > > > "In the current masked array, calculations are done for the whole array, > then masks are patched up afterwords. This means that invalid calculations > sitting in masked elements can raise warnings or exceptions even though they > shouldn't, so the ufunc error handling mechanism can't be relied on." > > > > Well, there's a reason for that. Initially, I tried to guess what the > mask of the output should be from the mask of the inputs, the objective > being to avoid getting NaNs in the C array. That was easy in most cases, > but it turned out it wasn't always possible (the `power` one caused me a > lot of issues, if I recall correctly). So, for performance issues (to avoid > a lot of expensive tests), I fell back on the old concept of "compute them > all, they'll be sorted afterwards". > > Of course, that's rather clumsy an approach. But it works not too badly > when in pure Python. No doubt that a proper C implementation would work > faster. > > Oh, about using NaNs for invalid data ? Well, can't work with integers. > > > > In my proposal, NaNs stay as unmasked NaN values, instead of turning into > masked values. This is necessary for uniform treatment of all dtypes, but a > subclass could override this behavior with an extra mask modification after > arithmetic operations. > > No problem with that... > > > > `mask` property: > > Nothing to add to it. It's basically what we have now (except for the > opposite convention). > > > > Working with masked values: > > I recall some strong points back in the days for not using None to > represent missing values... > > Adding a maskedstr argument to array2string ? Mmh... I prefer a global > flag like we have now. > > > > I'm not really a fan of all the global state that NumPy keeps, I guess > I'm trying to stamp that out bit by bit as well where I can... > > Pretty convenient to define a default once for all, though. > Maybe it needs to go in both places. > > Design questions: > > Adding `masked` or whatever we call it to a number/array should result is > masked/a fully masked array, period. That way, we can have an idea that > something was wrong with the initial dataset. > > > > I'm not sure I understand what you mean, in the design adding a mask > means setting "a.mask = True", "a.mask = False", or "a.mask = array>" in general. > > I mean that: > 0 + ma.masked = ma.masked > ma.array([1,2,3], mask=False) + ma.masked = ma.array([1,2,3], > mask=[True,True,True]) > > By extension, any operation involving a masked value should result in a > masked value. > R appears to consistently follow the model Nathaniel pointed out, and adopting the same one seems like a good idea to me. With the model in place, the desired result of these operations follows fairly naturally. > > hardmask: I never used the feature myself. I wonder if anyone did. Still, > it's a nice idea... > > > > Ok, I'll leave that out of the initial design unless someone comes up > with some strong use cases. > > Oh, it doesn't eat bread (as we say in French), so you can leave it where > it is... > Yeah, numpy.ma isn't going to disappear in a puff of smoke. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Jun 24 12:25:06 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 11:25:06 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04B7BF.5030300@noaa.gov> References: <4E04B7BF.5030300@noaa.gov> Message-ID: On Fri, Jun 24, 2011 at 11:13, Christopher Barker wrote: > Nathaniel Smith wrote: > >> If we think that the memory overhead for floating point types is too >> high, it would be easy to add a special case where maybe(float) used a >> distinguished NaN instead of a separate boolean. > > That would ?be pretty cool, though in the past folks have made a good > argument that even for floats, masks have significant advantages over > "just using NaN". One might be that you can mask and unmask a value for > different operations, without losing the value. No one is suggesting that the NA approach is universal or replaces masked arrays. It's better at some things and worse at others. For many things, usually tables of statistical data, "unmasking" makes no sense; the missing data is forever missing. For others, particularly cases where you have gridded data, unmasking and remasking can be quite useful. They are complementary tools. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From Chris.Barker at noaa.gov Fri Jun 24 12:25:53 2011 From: Chris.Barker at noaa.gov (Christopher Barker) Date: Fri, 24 Jun 2011 09:25:53 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: <4E04BA91.7080809@noaa.gov> Robert Kern wrote: > It's worth noting that this is not a replacement for masked arrays, > nor is it intended to be the be-all, end-all solution to missing data > problems. It's mostly just intended to be a focused tool to fill in > the gaps where masked arrays are less convenient for whatever reason; I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a "uniter, not a divider". While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation. > Not every dtype > would have an NA-aware counterpart. One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all. By the way, what might be the performance hit of a "new" dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an "if" statement with every single computation. But maybe that isn't any more of a hit that a separate mask. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From njs at pobox.com Fri Jun 24 12:28:33 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 09:28:33 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote: > The alternative proposal would be to add a few new dtypes that are > NA-aware. E.g. an nafloat64 would reserve a particular NaN value > (there are lots of different NaN bit patterns, we'd just reserve one) > that would represent NA. An naint32 would probably reserve the most > negative int32 value (like R does). Using the NA-aware dtypes signals > that you are using NA values; there is no need for an additional flag. For floats, this is easy, because NaN's are already built in. For integers, I worry a bit, because we'd have to break the usual two's complement arithmetic. int32 is closed under addition/multiplication/bitops. But for naint32, what's INT_MAX + 1? (In R, the answer is that *all* integer overflows are tested for and become NA, whether they would happen to land on INT_MIN or not, and AFAICT there are no bitops for integers.) For strings in the numpy context, just adding another byte to hold the NA-ness flag seems more sensible than stealing some random string. In both cases, the more generic maybe() dtype I suggested might be cleaner. -- Nathaniel From mwwiebe at gmail.com Fri Jun 24 12:33:35 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:33:35 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith wrote: > On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: > >> It's should also be possible to accomplish a general solution at the > >> dtype level. We could have a 'dtype factory' used like: > >> np.zeros(10, dtype=np.maybe(float)) > >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize > >> + 1, where the extra byte is used to store missingness information. > >> (There might be some annoying alignment issues to deal with.) Then for > >> each ufunc we define a handler for the maybe dtype (or add a > >> special-case to the ufunc dispatch machinery) that checks the > >> missingness value and then dispatches to the ordinary ufunc handler > >> for the wrapped dtype. > > > > The 'dtype factory' idea builds on the way I've structured datetime as a > > parameterized type, but the thing that kills it for me is the alignment > > problems of 'x.itemsize + 1'. Having the mask in a separate memory block > is > > a lot better than having to store 16 bytes for an 8-byte int to preserve > the > > alignment. > > Hmm. I'm not convinced that this is the best approach either, but let > me play devil's advocate. > > The disadvantage of this approach is that masked arrays would > effectively have a 100% memory overhead over regular arrays, as > opposed to the "shadow mask" approach where the memory overhead is > 12.5%--100% depending on the size of objects being stored. Probably > the most common case is arrays of doubles, in which case it's 100% > versus 12.5%. So that sucks. > > But on the other hand, we gain: > -- simpler implementation: no need to be checking and tracking the > mask buffer everywhere. The needed infrastructure is already built in. > I don't believe this is true. The dtype mechanism would need a lot of work to build that needed infrastructure first. The analysis I've done so far indicates the masked approach will give a simpler/cleaner implementation. > -- simpler conceptually: we already have the dtype concept, it's a > very powerful and we use it for all sorts of things; using it here too > plays to our strengths. We already know what a numpy scalar is and how > it works. Everyone already understands how assigning a value to an > element of an array works, how it interacts with broadcasting, etc., > etc., and in this model, that's all a missing value is -- just another > value. > >From Python, this aspect of things would be virtually identical between the two mechanisms. The dtype approach would require more coding and overhead where you have to create copies of your data to convert it into the parameterized "NA[int32]" dtype, versus with the masked approach where you say x.flags.hasmask = True or something like that without copying the data. > -- it composes better with existing functionality: for example, > someone mentioned the distinction between a missing field inside a > record versus a missing record. In this model, that would just be the > difference between dtype([("x", maybe(float))]) and maybe(dtype([("x", > float)])). > Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4], :y:NA[f4]" can't be expressed the way I've designed the mask functionality. (Note, this struct dtype string representation isn't actually supported in NumPy.) Optimization is important and all, but so is simplicity and > robustness. That's why we're using Python in the first place :-). > > If we think that the memory overhead for floating point types is too > high, it would be easy to add a special case where maybe(float) used a > distinguished NaN instead of a separate boolean. The extra complexity > would be isolated to the 'maybe' dtype's inner loop functions, and > transparent to the Python level. (Implementing a similar optimization > for the masking approach would be really nasty.) This would change the > overhead comparison to 0% versus 12.5% in favor of the dtype approach. > Yeah, there would be no such optimization for the masked approach. If someone really wants this, they are not precluded from also implementing their own "nafloat" dtype which operates independently of the masking mechanism. -Mark > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 12:45:04 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:45:04 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett wrote: > Hi, > > On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith wrote: > ... > > If we think that the memory overhead for floating point types is too > > high, it would be easy to add a special case where maybe(float) used a > > distinguished NaN instead of a separate boolean. The extra complexity > > would be isolated to the 'maybe' dtype's inner loop functions, and > > transparent to the Python level. (Implementing a similar optimization > > for the masking approach would be really nasty.) This would change the > > overhead comparison to 0% versus 12.5% in favor of the dtype approach. > > Can I take this chance to ask Mark a bit more about the problems he > sees for the dtypes with missing values? That is have a > > np.float64_with_missing > np.int32_with_missing > > type dtypes. I see in your NEP you say 'The trouble with this > approach is that it requires a large amount of special case code in > each data type, and writing a new data type supporting missing data > requires defining a mechanism for a special signal value which may not > be possible in general.' > > Just to be clear, you are saying that that, for each dtype, there > needs to be some code doing: > > missing_value = dtype.missing_value > > then, in loops: > > if val[here] == missing_value: > do_something() > > and the fact that 'missing_value' could be any type would make the > code more complicated than the current case where the mask is always > bools or something? > I'm referring to the underlying C implementations of the dtypes and any additional custom dtypes that people create. With the masked approach, you implement a new custom data type in C, and it automatically works with missing data. With the custom dtype approach, you have to do a lot more error-prone work to handle the special values in all the ufuncs. > > Nathaniel's point about reduction in storage needed for the mask to 0 > is surely significant if we want numpy to be the best choice for big > data. > The mask will only be there if it's explicitly requested, so it's not taking away from NumPy in any way. If someone is dealing with data that large, I likely wouldn't always be with the particular NA conventions NumPy chooses for the various primitive data types, so that approach isn't a clear win either. You mention that it would be good to allow masking for any new dtype - > is that a practical problem? I mean, how many people will in fact > have the combination of a) need of masking b) need of custom dtype, > and c) lack of time or expertise to implement masking for that type? > Well, the people who need that right now will probably look at the NumPy C source code and give up immediately. I'd rather push the system in a direction of it being easier for those people than harder. It should be possible to define a C++ data type class with overloaded operators, then say NPY_EXPOSE_DTYPE(MyCustomClass), which would wrap those overloaded operators with NumPy conventions. If this were done, I suspect many people would create custom data types. -Mark > > Thanks a lot for the proposal and the discussion, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 12:48:56 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:48:56 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04834F.6060708@gmail.com> References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 7:30 AM, Laurent Gautier wrote: > On 2011-06-24 13:59, Nathaniel Smith wrote: > > On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: > >> Lastly, I am not entirely familiar with R, so I am also very curious > about > >> what this magical "NA" value is, and how it compares to how NaNs work. > >> Although, Pierre brought up the very good point that NaNs woulldn't work > >> anyway with integer arrays (and object arrays, etc.). > > Since R is designed for statistics, they made the interesting decision > > that *all* of their core types have a special designated "missing" > > value. At the R level this is just called "NA". Internally, there are > > a bunch of different NA values -- for floats it's a particular NaN, > > for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never > > notice this, because R will silently cast a NA of one type into NA of > > another type whenever needed, and they all print the same.) > > > > Because any array can contain NA's, all R functions then have to have > > some way of handling this -- all their integer arithmetic knows that > > INT_MIN is special, for instance. The rules are basically the same as > > for NaN's, but NA and NaN are different from each other (because one > > means "I don't know, could be anything" and the other means "you tried > > to divide by 0, I *know* that's meaningless"). > > > > That's basically it. > > > > -- Nathaniel > > Would the use of R's system for expressing "missing values" be possible > in numpy through a special flag ? > I think that's something R2Py would have to handle in its compatibility later. I'd like to first make the system within NumPy work well for NumPy, interoperability at the low ABI level like this is a bit too restricting, I think. -Mark Any given numpy array could have a boolean flag (say "na_aware") > indicating that some of the values are representing a missing cell. > > If the exact same system is used, interaction with R (through something > like rpy2) would be simplified and more robust. > > L. > > PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN. > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 12:50:30 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:50:30 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 8:01 AM, Neal Becker wrote: > Just 1 question before I look more closely. What is the cost to the non-MA > user > of this addition? > I'm following the idea that you don't pay for what you don't use. All the existing stuff will perform the same. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jun 24 12:54:23 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 09:54:23 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 9:33 AM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith wrote: >> But on the other hand, we gain: >> ?-- simpler implementation: no need to be checking and tracking the >> mask buffer everywhere. The needed infrastructure is already built in. > > I don't believe this is true. The dtype mechanism would need a lot of work > to build that needed infrastructure first. The analysis I've done so far > indicates the masked approach will give a simpler/cleaner implementation. Really? One implementation option would be the one I described, where it just uses the standard dtype extension machinery. AFAICT the only core change that would be needed to allow that is for us to add another argument to the PyUFuncGenericFunction signature, which contains the dtype(s) of the array(s) being operated on. The other option would be to add a special case to the ufunc looping code, so that if we have a 'maybe' dtype it would check the NA flag itself before calling down to the actual ufunc's. (This is more intrusive, but not any more intrusive than the masking approach.) Neither approach seems to require any infrastructure changes as a prerequisite to me, so probably I'm missing something. What problems are you thinking of? >> ?-- simpler conceptually: we already have the dtype concept, it's a >> very powerful and we use it for all sorts of things; using it here too >> plays to our strengths. We already know what a numpy scalar is and how >> it works. Everyone already understands how assigning a value to an >> element of an array works, how it interacts with broadcasting, etc., >> etc., and in this model, that's all a missing value is -- just another >> value. > > From Python, this aspect of things would be virtually identical between the > two mechanisms. The dtype approach would require more coding and overhead > where you have to create copies of your data to convert it into the > parameterized "NA[int32]" dtype, versus with the masked approach where you > say x.flags.hasmask = True?or something like that without copying the data. Yes, converting from a regular dtype to an NA-ful dtype would require a copy, but that's true for any dtype conversion. The solution is the same as always -- just use the correct dtype from the start. Is there some reason why people would often start with the 'wrong' dtype and then need to convert? -- Nathaniel From mwwiebe at gmail.com Fri Jun 24 12:55:42 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 11:55:42 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 8:57 AM, Keith Goodman wrote: > On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman > wrote: > >> > >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe wrote: > >> > Enthought has asked me to look into the "missing data" problem and how > >> > NumPy > >> > could treat it better. I've considered the different ideas of adding > >> > dtype > >> > variants with a special signal value and masked arrays, and concluded > >> > that > >> > adding masks to the core ndarray appears is the best way to deal with > >> > the > >> > problem in general. > >> > I've written a NEP that proposes a particular design, viewable here: > >> > > >> > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> > There are some questions at the bottom of the NEP which definitely > need > >> > discussion to find the best design choices. Please read, and let me > know > >> > of > >> > all the errors and gaps you find in the document. > >> > >> Wow, that is exciting. > >> > >> I wonder about the relative performance of the two possible > >> implementations (mask and NA) in the PEP. > > > > I've given that some thought, and I don't think there's a clear way to > tell > > what the performance gap would be without implementations of both to > > benchmark against each other. I favor the mask primarily because it > provides > > masking for all data types in one go with a single consistent interface > to > > program against. For adding NA signal values, each new data type would > need > > a lot of work to gain the same level of support. > >> > >> If you are, say, doing a calculation along the columns of a 2d array > >> one element at a time, then you will need to grab an element from the > >> array and grab the corresponding element from the mask. I assume the > >> corresponding data and mask elements are not stored together. That > >> would be slow since memory access is usually were time is spent. In > >> this regard NA would be faster. > > > > Yes, the masks add more memory traffic and some extra calculation, while > the > > NA signal values just require some additional calculations. > > I guess a better example would have been summing along rows instead of > columns of a large C order array. If one needs to look at both the > data and the mask then wouldn't summing along rows in cython be about > as slow as it is currently to sum along columns? > Not quite, both the mask and the array data are being traversed coherently, so it isn't jumping around in memory like in the columns case you're describing. > >> I currently use NaN as a missing data marker. That adds things like > >> this to my cython code: > >> > >> if a[i] == a[i]: > >> asum += a[i] > >> > >> If NA also had the property NA == NA is False, then it would be easy > >> to use. > > > > That's what I believe it should do, and I guess this is a strike against > the > > idea of returning None for a single missing value. > > If NA == NA is False then I wouldn't need to look at the mask in the > example above. Or would ndarray have to look at the mask in order to > return NA for a[i]? Which would mean __getitem__ would need to look at > the mask? > What R does is return NA for NA == NA. Then, if you try to use it as a boolean, it throws an exception. I like this approach. If the missing value is returned as a 0d array (so that NA == NA is > False), would that break cython in a fundamental way since it could > not always return a same-sized scalar when you index into an array? > I don't know enough about Cython internals to comment, sorry. -Mark >> A mask, on the other hand, would be more difficult for third > >> party packages to support. You have to check if the mask is present > >> and if so do a mask-aware calculation; if is it not present then you > >> have to do a non-mask based calculation. > > > > I actually see the mask as being easier for third party packages to > support, > > particularly from C. Having regular C-friendly values with a boolean mask > is > > a lot friendlier than values that require a lot of special casing like > the > > NA signal values would require. > > > >> > >> So you have two code paths. > >> You also need to check if any of the input arrays have masks and if so > >> apply masks to the other inputs, etc. > > > > Most of the time, the masks will transparently propagate or not along > with > > the arrays, with no effort required. In Python, the code you write would > be > > virtually the same between the two approaches. > > -Mark > > > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srean.list at gmail.com Fri Jun 24 12:58:27 2011 From: srean.list at gmail.com (srean) Date: Fri, 24 Jun 2011 11:58:27 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 9:12 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 04:03, srean wrote: > > To answer my own question, I guess I can keep appending to a > array.array() > > object and get a numpy.array from its buffer if possible. Is that the > > efficient way. > > It's one of the most efficient ways to do it, yes, especially for 1D > arrays. > Thanks for the reply. My first cut was to try cStringIO because I thought I could use the writelines() method and hence avoid the for loop in the python code. It would have been nice if that had worked. If I understood it correctly the looping would have been in a C function call. regards srean -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jun 24 13:06:57 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 24 Jun 2011 13:06:57 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith wrote: >> >> On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe wrote: >> > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith wrote: >> >> It's should also be possible to accomplish a general solution at the >> >> dtype level. We could have a 'dtype factory' used like: >> >> ?np.zeros(10, dtype=np.maybe(float)) >> >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize >> >> + 1, where the extra byte is used to store missingness information. >> >> (There might be some annoying alignment issues to deal with.) Then for >> >> each ufunc we define a handler for the maybe dtype (or add a >> >> special-case to the ufunc dispatch machinery) that checks the >> >> missingness value and then dispatches to the ordinary ufunc handler >> >> for the wrapped dtype. >> > >> > The 'dtype factory' idea builds on the way I've structured datetime as a >> > parameterized type, but the thing that kills it for me is the alignment >> > problems of 'x.itemsize + 1'. Having the mask in a separate memory block >> > is >> > a lot better than having to store 16 bytes for an 8-byte int to preserve >> > the >> > alignment. >> >> Hmm. I'm not convinced that this is the best approach either, but let >> me play devil's advocate. >> >> The disadvantage of this approach is that masked arrays would >> effectively have a 100% memory overhead over regular arrays, as >> opposed to the "shadow mask" approach where the memory overhead is >> 12.5%--100% depending on the size of objects being stored. Probably >> the most common case is arrays of doubles, in which case it's 100% >> versus 12.5%. So that sucks. >> >> But on the other hand, we gain: >> ?-- simpler implementation: no need to be checking and tracking the >> mask buffer everywhere. The needed infrastructure is already built in. > > I don't believe this is true. The dtype mechanism would need a lot of work > to build that needed infrastructure first. The analysis I've done so far > indicates the masked approach will give a simpler/cleaner implementation. > >> >> ?-- simpler conceptually: we already have the dtype concept, it's a >> very powerful and we use it for all sorts of things; using it here too >> plays to our strengths. We already know what a numpy scalar is and how >> it works. Everyone already understands how assigning a value to an >> element of an array works, how it interacts with broadcasting, etc., >> etc., and in this model, that's all a missing value is -- just another >> value. > > From Python, this aspect of things would be virtually identical between the > two mechanisms. The dtype approach would require more coding and overhead > where you have to create copies of your data to convert it into the > parameterized "NA[int32]" dtype, versus with the masked approach where you > say x.flags.hasmask = True?or something like that without copying the data. > >> >> ?-- it composes better with existing functionality: for example, >> someone mentioned the distinction between a missing field inside a >> record versus a missing record. In this model, that would just be the >> difference between dtype([("x", maybe(float))]) and maybe(dtype([("x", >> float)])). > > Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4], > :y:NA[f4]" can't be expressed the way I've designed the mask functionality. > (Note, this struct dtype string representation isn't actually supported in > NumPy.) >> >> Optimization is important and all, but so is simplicity and >> robustness. That's why we're using Python in the first place :-). >> >> If we think that the memory overhead for floating point types is too >> high, it would be easy to add a special case where maybe(float) used a >> distinguished NaN instead of a separate boolean. The extra complexity >> would be isolated to the 'maybe' dtype's inner loop functions, and >> transparent to the Python level. (Implementing a similar optimization >> for the masking approach would be really nasty.) This would change the >> overhead comparison to 0% versus 12.5% in favor of the dtype approach. > > Yeah, there would be no such optimization for the masked approach. If > someone really wants this, they are not precluded from also implementing > their own "nafloat" dtype which operates independently of the masking > mechanism. > -Mark > >> >> -- Nathaniel >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > I don't have enough time to engage in this discussion as I'd like but I'll give my input. I've spent a very amount of time in pandas trying to craft a sensible and performant missing-data-handling solution giving existing tools which does not get in the user's way and which also works for non-floating point data. The result works but isn't completely 100% satisfactory, and I went by the Zen of Python in that "practicality beats purity". If anyone's interested, have a pore through the pandas unit tests for lots of exceptional cases and examples. About 38 months ago when I started writing the library now called pandas I examined numpy.ma and friends and decided that a) the performance overhead for floating point data was not acceptable b) numpy.ma does too much for the needs of financial applications, say, or in mimicing R's NA functionality (part of why perf suffers) c) masked arrays are difficult (imho) for non-expert users to use effectively. In my experience, it gets in your way, and subclassing is largely to blame for this (along with the mask field and the myriad mask-related functions). It's very "complete" from a technical purity / computer science-y standpoint by practicality is traded off (re: Zen of Python). In R many functions have a flag to handle NA's like na.rm and there is the is.na function, along with a few other NA-handling functions, and that's it. I wasn't willing to teach my colleagues / users of pandas how to use masked arrays (or scikits.timeseries, because of numpy.ma reliance) for this reason. I believe that this has overall been the right decision. So whatever solution you come up with, you need to dogfood it if possible with users who are only at a beginning-to-intermediate level of NumPy or Python expertise. Does it get in the way? Does it require constant tinkering with masks (if there is a boolean mask versus a special NA value)? Intuitive? Hopefully I can take whatever result comes of this development effort and change pandas to be implemented on top of it without changing the existing API / behavior in any significant ways. If I cannot, I will be (very, very) sad. (I don't mean to be overly critical of numpy.ma-- I just care about solving problems and making the tools as easy-to-use and intuitive as possible.) - Wes From alex.flint at gmail.com Fri Jun 24 13:39:05 2011 From: alex.flint at gmail.com (Alex Flint) Date: Fri, 24 Jun 2011 13:39:05 -0400 Subject: [Numpy-discussion] correlation in fourier domain Message-ID: I would like to perform 2d correlations between a large matrix A and a bunch of small matrices B1...Bn. Is there anything built into numpy to help me do this efficiently in the fourier domain (i.e. DFT of A -> multiplication with DFT of B1..Bn)? Cheers, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 13:43:42 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 12:43:42 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E049EEB.1060203@gmail.com> References: <4E04834F.6060708@gmail.com> <4E049EEB.1060203@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 9:27 AM, Bruce Southey wrote: > ** > On 06/24/2011 09:06 AM, Robert Kern wrote: > > On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote: > > On 2011-06-24 13:59, Nathaniel Smith wrote: > > On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: > > Lastly, I am not entirely familiar with R, so I am also very curious about > what this magical "NA" value is, and how it compares to how NaNs work. > Although, Pierre brought up the very good point that NaNs woulldn't work > anyway with integer arrays (and object arrays, etc.). > > Since R is designed for statistics, they made the interesting decision > that *all* of their core types have a special designated "missing" > value. At the R level this is just called "NA". Internally, there are > a bunch of different NA values -- for floats it's a particular NaN, > for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never > notice this, because R will silently cast a NA of one type into NA of > another type whenever needed, and they all print the same.) > > Because any array can contain NA's, all R functions then have to have > some way of handling this -- all their integer arithmetic knows that > INT_MIN is special, for instance. The rules are basically the same as > for NaN's, but NA and NaN are different from each other (because one > means "I don't know, could be anything" and the other means "you tried > to divide by 0, I *know* that's meaningless"). > > That's basically it. > > -- Nathaniel > > Would the use of R's system for expressing "missing values" be possible > in numpy through a special flag ? > > Any given numpy array could have a boolean flag (say "na_aware") > indicating that some of the values are representing a missing cell. > > If the exact same system is used, interaction with R (through something > like rpy2) would be simplified and more robust. > > The alternative proposal would be to add a few new dtypes that are > NA-aware. E.g. an nafloat64 would reserve a particular NaN value > (there are lots of different NaN bit patterns, we'd just reserve one) > that would represent NA. An naint32 would probably reserve the most > negative int32 value (like R does). Using the NA-aware dtypes signals > that you are using NA values; there is no need for an additional flag. > > > > There is an very important distinction here between a masked value and a > missing value. In some sense, a missing value is a permanently masked value > and may be indistinguishable from 'not a number'. But a masked value is not > a missing value because with masked arrays the original values still exists. > Consequently using a masked value for 'missing-ness' can be reversed by the > user changing the mask at any time. That is really the power of masked > arrays as you can real missing values but also 'flag' unusual values as > missing! > In the design I'm proposing, it's using a mask to implement missing values, hence the usage of the terms "masked" and "unmasked" elements. The semantics you're describing can be achieved with the missing value interpretation. First, you take a view of your array, then give it a mask. In the view, there will be strict missing data semantics, but the data is still accessible through the original array. When people come to NumPy asking about missing values, they generally get pointed at numpy.ma, so there is an impression out there that it's intended for that usage. Virtually software packages are handling missing values not masked values. > So it is really, really important that you clarify what you are proposing > because your proposal does mix these two different concepts. > > As per the missing value discussion, I would think that adding a missing > value data type(s) for 'missing values' would be feasible and may be > something that numpy should have. But that would not address 'masked values' > and probably must be view as independent topic and thread. > Can you describe what needed features are missing from taking a view + adding a mask, that 'masked values' as a separate concept would have? While different, I think a single implementation with strict semantics can provide both perspectives when used like this or in a similar fashion. > Below are some sources for missing values in R and SAS. SAS has 28 ways > that a user can define numerical values as 'missing values' - not just the > dot! While not apparently universal, SAS has missing value codes to handle > positive and negative infinity. R does differ between missing values and > 'not a number' which, to my knowledge, SAS does not do. > The question this raises in my mind is whether an "NA"-like object in NumPy should have a type associated with it, or whether it should be a singleton. Like np.NA('i8') for a missing 64-bit int, or np.NA like None but with the specific missing value semantics. Keeping the type around would allow for checking against casting rules. > This distinction is probably important for masked vs missing values. SAS > uses a blank for missing character values but see the two links at the end > for more than. > > This is for R: > http://faculty.nps.edu/sebuttre/home/S/missings.html > http://www.ats.ucla.edu/stat/r/faq/missing.htm > > > http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001292604.htm > This page is a comparison to R: > > http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer.htm#imlug_r_sect019.htm > > Some other SAS sources: > Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" > http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF > 072-2011: Special Missing Values for Character Fields - SAS > support.*sas*.com/resources/papers/proceedings11/072-2011.pdf > Thanks for the links! -Mark > > > > Bruce > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 13:55:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 12:55:51 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> References: <4E04834F.6060708@gmail.com> <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:02 AM, Pierre GM wrote: > On Jun 24, 2011, at 4:44 PM, Robert Kern wrote: > > > On Fri, Jun 24, 2011 at 09:35, Robert Kern > wrote: > >> On Fri, Jun 24, 2011 at 09:24, Keith Goodman > wrote: > >>> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern > wrote: > >>> > >>>> The alternative proposal would be to add a few new dtypes that are > >>>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value > >>>> (there are lots of different NaN bit patterns, we'd just reserve one) > >>>> that would represent NA. An naint32 would probably reserve the most > >>>> negative int32 value (like R does). Using the NA-aware dtypes signals > >>>> that you are using NA values; there is no need for an additional flag. > >>> > >>> I don't understand the numpy design and maintainable issues, but from > >>> a user perspective (mine) nafloat64, etc sounds nice. > >> > >> It's worth noting that this is not a replacement for masked arrays, > >> nor is it intended to be the be-all, end-all solution to missing data > >> problems. It's mostly just intended to be a focused tool to fill in > >> the gaps where masked arrays are less convenient for whatever reason; > >> e.g. where you're tempted to (ab)use NaNs for the purpose and the > >> limitations on the range of values is acceptable. Not every dtype > >> would have an NA-aware counterpart. I would suggest just nabool, > >> nafloat64, naint32, nastring (a little tricky due to the flexible > >> size, but doable), and naobject. Maybe a couple more, if we get > >> requests, like naint64 and nacomplex128. > > > > Oh, and nadatetime64 and natimedelta64. > > So, if I understand correctly: > if my array has a nafloat type, it's an array that supports missing values > and it will always have a mask, right ? And just viewing an array as a > nafloat dtyped one would make it an 'array-with-missing-values' ? That's > pretty elegant. I like that. > My understanding is a little bit different: The na* discussion is about implementing a full or partial set of "shadow types" which are like their regular types, but have a signal value indicating they are "NA". There's another idea, to create a parameterized type mechanism with types like "NA[int32]", adding a missing-value flag to the int32 and growing its size by up to the dtype's alignment. Using the mask to implement the missing value semantics at the array level instead of the dtype level is my proposal, neither of the others involve separate masks. > Now, how will masked values represented ? Different masked values from one > dtype to another ? What would be the equivalent of something like `if a[0] > is masked` that we have know? > If there's a global np.NA singleton, `if a[0] is np.NA` would work equivalently. That's a strike against storing the dtype with the NA object. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 13:57:14 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 12:57:14 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:07 AM, Matthew Brett wrote: > Hi, > > On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern > wrote: > > On Fri, Jun 24, 2011 at 09:33, Charles R Harris > > wrote: > >> > >> On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern > wrote: > > > >>> The alternative proposal would be to add a few new dtypes that are > >>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value > >>> (there are lots of different NaN bit patterns, we'd just reserve one) > >>> that would represent NA. An naint32 would probably reserve the most > >>> negative int32 value (like R does). Using the NA-aware dtypes signals > >>> that you are using NA values; there is no need for an additional flag. > >> > >> Definitely better names than r-int32. Going this way has the advantage > of > >> reducing the friction between R and numpy, and since R has pretty much > >> become the standard software for statistics that is an important > >> consideration. > > > > I would definitely steal their choices of NA value for naint32 and > > nafloat64. I have reservations about their string NA value (i.e. 'NA') > > as anyone doing business in North America and other continents may > > have issues with that.... > > It would certainly help me at least if someone (Mark? sorry to > ask...) could set out the implementation and API differences that > would result from the two options: > > 1) array.mask option - an integer array of shape array.shape giving > mask (True, False) values for each element > 2) nafloat64 option - dtypes with specified dtype-specific missing values > That's something that should go in the NEP, I'll email when I update it. -Mark > > Best, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Fri Jun 24 14:04:11 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 19:04:11 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> Message-ID: Hi, Just as a use case, if I do this: a = np.zeros((big_number,), dtype=np.int32) a[0,0] = np.NA I think I'm right in saying that, with the array.mask implementation my array memory usage with grow by new big_number bytes, whereas with the np.naint32 implementation you'd get something like: Error('This data type does not allow missing values') Is that right? See y'all, Matthew From mwwiebe at gmail.com Fri Jun 24 14:06:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 13:06:44 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04B7BF.5030300@noaa.gov> References: <4E04B7BF.5030300@noaa.gov> Message-ID: On Fri, Jun 24, 2011 at 11:13 AM, Christopher Barker wrote: > Nathaniel Smith wrote: > >> The 'dtype factory' idea builds on the way I've structured datetime as a > >> parameterized type, > > ... > > Another disadvantage is that we get further from Gael Varoquaux's point: > >> Right now, the numpy array can be seen as an extension of the C > >> array, basically a pointer, a data type, and a shape (and strides). > >> This enables easy sharing with libraries that have not been > >> written with numpy in mind. > > and also PEP 3118 support > > It is very useful that a numpy array has a pointer to a regular old C > array -- if we introduce this special dtype, that will break (well, not > really, put the the c array would be of this particular struct). > Granted, any other C code would properly have to do something with the > mask anyway, but I still think it'd be better to keep that raw data > array standard. > It's not actually a pointer to a C array, there is already a lot of checking and possibly a copy/buffer required before you can treat it as such. The data may be misaligned, have noncontiguous strides, have a non-C multidimensional memory layout, or have a different byte order. Dealing with all these special cases in a uniform way is one of the things the 1.6 nditer provides a lot of helps for. > > This applies to switching between masked and not-masked numpy arrays > also -- I don't think I'd want the performance hot of that requiring a > data copy. > When performance is important, it is still possible to avoid that copy - by adding the mask to a view of the original array. The mask= parameter to ufuncs, something which is independent of arrays with masks, also provides a way to do masked operations without ever touching masked arrays. Also the idea was posted here that you could use views to have the same > data set with different masks -- that would break as well. > I'm not sure how this would break? I think that should work just fine. > > Nathaniel Smith wrote: > > > If we think that the memory overhead for floating point types is too > > high, it would be easy to add a special case where maybe(float) used a > > distinguished NaN instead of a separate boolean. > > That would be pretty cool, though in the past folks have made a good > argument that even for floats, masks have significant advantages over > "just using NaN". One might be that you can mask and unmask a value for > different operations, without losing the value. > Especially with the ability to do the "hardmask" feature, this aspect of it might end up being useful. -Mark > > -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 14:09:50 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 13:09:50 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04B7BF.5030300@noaa.gov> Message-ID: On Fri, Jun 24, 2011 at 11:25 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 11:13, Christopher Barker > wrote: > > Nathaniel Smith wrote: > > > >> If we think that the memory overhead for floating point types is too > >> high, it would be easy to add a special case where maybe(float) used a > >> distinguished NaN instead of a separate boolean. > > > > That would be pretty cool, though in the past folks have made a good > > argument that even for floats, masks have significant advantages over > > "just using NaN". One might be that you can mask and unmask a value for > > different operations, without losing the value. > > No one is suggesting that the NA approach is universal or replaces > masked arrays. It's better at some things and worse at others. For > many things, usually tables of statistical data, "unmasking" makes no > sense; the missing data is forever missing. For others, particularly > cases where you have gridded data, unmasking and remasking can be > quite useful. They are complementary tools. > It appears to me that views of arrays with masks as in my proposal can support this masking/unmasking semantics without breaking the missing value abstraction. It would be nice for these concepts to fit together well in a single system, anyhow. -Mark > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless > enigma that is made terrible by our own mad attempt to interpret it as > though it had an underlying truth." > -- Umberto Eco > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 14:14:12 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 13:14:12 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <4E04BA91.7080809@noaa.gov> References: <4E04834F.6060708@gmail.com> <4E04BA91.7080809@noaa.gov> Message-ID: On Fri, Jun 24, 2011 at 11:25 AM, Christopher Barker wrote: > Robert Kern wrote: > > > It's worth noting that this is not a replacement for masked arrays, > > nor is it intended to be the be-all, end-all solution to missing data > > problems. It's mostly just intended to be a focused tool to fill in > > the gaps where masked arrays are less convenient for whatever reason; > > I think we have enough problems with ndarray vs numpy.ma -- this is > introducing a third option? IMHO, and from the discussion, it seems this > proposal should be a "uniter, not a divider". > > While masked arrays may still exist, it would be nice if they were an > extension of the new built-in thingie, not yet another implementation. > If someone wants to have the current numpy.ma semantics with regards to NaN's becoming masked automatically, or some other behavior not in my proposal, I think a subclass with tweaks would be good, yes. > Not every dtype > > would have an NA-aware counterpart. > > One of the great things about numpy is the full range of data types. I > think it would be surprising and frustrating not to have masked versions > of them all. > > By the way, what might be the performance hit of a "new" dtype -- > wouldn't we lose all sort of opportunities for the compiler and hardware > to optimize? I can only image an "if" statement with every single > computation. But maybe that isn't any more of a hit that a separate mask. > There is also a potential reliability issue with the na* dtype approach. Every single operation of every dtype has to get the NA logic correct. While there may be ways to unify the code in various ways, ensuring robustness of that is more difficult than for a system which has a single implementation of the NA logic applied to masks. -Mark > > -Chris > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Fri Jun 24 14:18:07 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 19:18:07 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: Hi, On Fri, Jun 24, 2011 at 5:45 PM, Mark Wiebe wrote: > On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett > wrote: >> >> Hi, >> >> On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith wrote: ... >> and the fact that 'missing_value' could be any type would make the >> code more complicated than the current case where the mask is always >> bools or something? > > I'm referring to the underlying C implementations of the dtypes and any > additional custom dtypes that people create. With the masked approach, you > implement a new custom data type in C, and it automatically works with > missing data. With the custom dtype approach, you have to do a lot more > error-prone work to handle the special values in all the ufuncs. This is just pure ignorance on my part, but I can see that the ufuncs need to handle the missing values, but I can't see immediately why that will be much more complicated than the 'every array might have a mask' implementation. This was what I was trying to say with my silly sketch: missing_value = np.dtype.missing_value for e in oned_array: if e == missing_value: well - you get the idea. Obviously this is what you've been thinking about, I just wanted to get a grasp of where the extra complexity is coming from compared to: for i, e in enumerate(one_d_array): if one_d_array.mask[i] == False: Cheers, Matthew From mwwiebe at gmail.com Fri Jun 24 15:26:05 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 14:26:05 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 11:54 AM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 9:33 AM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith wrote: > >> But on the other hand, we gain: > >> -- simpler implementation: no need to be checking and tracking the > >> mask buffer everywhere. The needed infrastructure is already built in. > > > > I don't believe this is true. The dtype mechanism would need a lot of > work > > to build that needed infrastructure first. The analysis I've done so far > > indicates the masked approach will give a simpler/cleaner implementation. > > Really? One implementation option would be the one I described, where > it just uses the standard dtype extension machinery. AFAICT the only > core change that would be needed to allow that is for us to add > another argument to the PyUFuncGenericFunction signature, which > contains the dtype(s) of the array(s) being operated on. The other > option would be to add a special case to the ufunc looping code, so > that if we have a 'maybe' dtype it would check the NA flag itself > before calling down to the actual ufunc's. (This is more intrusive, > but not any more intrusive than the masking approach.) > Having extended the ufuncs in a direction towards supporting parameterized types (with just some baby steps) for the datetime64, the idea that dtype deviating from a straightforward number will work with "standard dtype extension machinery" doesn't seem right to me. For the maybe dtype, it would need to gain access to the ufunc loop of the underlying dtype, and call it appropriately during the inner loop. This appears to require some more invasive upheaval within the ufunc code than the masking approach. Neither approach seems to require any infrastructure changes as a > prerequisite to me, so probably I'm missing something. What problems > are you thinking of? > Adding nditer, re-enabling ABI compatibility for 1.6, making datetime64 work reasonably as a parameterized type, these are all things that required a fair bit of changes to NumPy's infrastructure. Making either of these missing value designs, which are doing something new that hasn't previously been done in NumPy at a C level, will definitely require something similar. >> -- simpler conceptually: we already have the dtype concept, it's a > >> very powerful and we use it for all sorts of things; using it here too > >> plays to our strengths. We already know what a numpy scalar is and how > >> it works. Everyone already understands how assigning a value to an > >> element of an array works, how it interacts with broadcasting, etc., > >> etc., and in this model, that's all a missing value is -- just another > >> value. > > > > From Python, this aspect of things would be virtually identical between > the > > two mechanisms. The dtype approach would require more coding and overhead > > where you have to create copies of your data to convert it into the > > parameterized "NA[int32]" dtype, versus with the masked approach where > you > > say x.flags.hasmask = True or something like that without copying the > data. > > Yes, converting from a regular dtype to an NA-ful dtype would require > a copy, but that's true for any dtype conversion. The solution is the > same as always -- just use the correct dtype from the start. Is there > some reason why people would often start with the 'wrong' dtype and > then need to convert? > Here's a possible scenario: Someone has a large binary data file, say 1GB or so, that they have memmaped to a NumPy array. Now they want to apply some criteria to determine which rows to keep and which to ignore. Being able to add a mask to this array and treat it with the missing data mechanism seems like it would be very attractive here. -Mark > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 15:36:02 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 14:36:02 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 12:06 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe wrote: > > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith wrote: > >> > >> On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe wrote: > >> > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith > wrote: > >> >> It's should also be possible to accomplish a general solution at the > >> >> dtype level. We could have a 'dtype factory' used like: > >> >> np.zeros(10, dtype=np.maybe(float)) > >> >> where np.maybe(x) returns a new dtype whose storage size is > x.itemsize > >> >> + 1, where the extra byte is used to store missingness information. > >> >> (There might be some annoying alignment issues to deal with.) Then > for > >> >> each ufunc we define a handler for the maybe dtype (or add a > >> >> special-case to the ufunc dispatch machinery) that checks the > >> >> missingness value and then dispatches to the ordinary ufunc handler > >> >> for the wrapped dtype. > >> > > >> > The 'dtype factory' idea builds on the way I've structured datetime as > a > >> > parameterized type, but the thing that kills it for me is the > alignment > >> > problems of 'x.itemsize + 1'. Having the mask in a separate memory > block > >> > is > >> > a lot better than having to store 16 bytes for an 8-byte int to > preserve > >> > the > >> > alignment. > >> > >> Hmm. I'm not convinced that this is the best approach either, but let > >> me play devil's advocate. > >> > >> The disadvantage of this approach is that masked arrays would > >> effectively have a 100% memory overhead over regular arrays, as > >> opposed to the "shadow mask" approach where the memory overhead is > >> 12.5%--100% depending on the size of objects being stored. Probably > >> the most common case is arrays of doubles, in which case it's 100% > >> versus 12.5%. So that sucks. > >> > >> But on the other hand, we gain: > >> -- simpler implementation: no need to be checking and tracking the > >> mask buffer everywhere. The needed infrastructure is already built in. > > > > I don't believe this is true. The dtype mechanism would need a lot of > work > > to build that needed infrastructure first. The analysis I've done so far > > indicates the masked approach will give a simpler/cleaner implementation. > > > >> > >> -- simpler conceptually: we already have the dtype concept, it's a > >> very powerful and we use it for all sorts of things; using it here too > >> plays to our strengths. We already know what a numpy scalar is and how > >> it works. Everyone already understands how assigning a value to an > >> element of an array works, how it interacts with broadcasting, etc., > >> etc., and in this model, that's all a missing value is -- just another > >> value. > > > > From Python, this aspect of things would be virtually identical between > the > > two mechanisms. The dtype approach would require more coding and overhead > > where you have to create copies of your data to convert it into the > > parameterized "NA[int32]" dtype, versus with the masked approach where > you > > say x.flags.hasmask = True or something like that without copying the > data. > > > >> > >> -- it composes better with existing functionality: for example, > >> someone mentioned the distinction between a missing field inside a > >> record versus a missing record. In this model, that would just be the > >> difference between dtype([("x", maybe(float))]) and maybe(dtype([("x", > >> float)])). > > > > Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4], > > :y:NA[f4]" can't be expressed the way I've designed the mask > functionality. > > (Note, this struct dtype string representation isn't actually supported > in > > NumPy.) > >> > >> Optimization is important and all, but so is simplicity and > >> robustness. That's why we're using Python in the first place :-). > >> > >> If we think that the memory overhead for floating point types is too > >> high, it would be easy to add a special case where maybe(float) used a > >> distinguished NaN instead of a separate boolean. The extra complexity > >> would be isolated to the 'maybe' dtype's inner loop functions, and > >> transparent to the Python level. (Implementing a similar optimization > >> for the masking approach would be really nasty.) This would change the > >> overhead comparison to 0% versus 12.5% in favor of the dtype approach. > > > > Yeah, there would be no such optimization for the masked approach. If > > someone really wants this, they are not precluded from also implementing > > their own "nafloat" dtype which operates independently of the masking > > mechanism. > > -Mark > > > >> > >> -- Nathaniel > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > I don't have enough time to engage in this discussion as I'd like but > I'll give my input. > > I've spent a very amount of time in pandas trying to craft a sensible > and performant missing-data-handling solution giving existing tools > which does not get in the user's way and which also works for > non-floating point data. The result works but isn't completely 100% > satisfactory, and I went by the Zen of Python in that "practicality > beats purity". If anyone's interested, have a pore through the pandas > unit tests for lots of exceptional cases and examples. > > About 38 months ago when I started writing the library now called > pandas I examined numpy.ma and friends and decided that > > a) the performance overhead for floating point data was not acceptable > b) numpy.ma does too much for the needs of financial applications, > say, or in mimicing R's NA functionality (part of why perf suffers) > c) masked arrays are difficult (imho) for non-expert users to use > effectively. In my experience, it gets in your way, and subclassing is > largely to blame for this (along with the mask field and the myriad > mask-related functions). It's very "complete" from a technical purity > / computer science-y standpoint by practicality is traded off (re: Zen > of Python). In R many functions have a flag to handle NA's like na.rm > and there is the is.na function, along with a few other NA-handling > functions, and that's it. I wasn't willing to teach my colleagues / > users of pandas how to use masked arrays (or scikits.timeseries, > because of numpy.ma reliance) for this reason. I believe that this has > overall been the right decision. > Having a more R-like interface to the underlying masked implementation is what I'm leaning towards, with the mask being an implementation detail which is still accessible, but not at centre stage. So whatever solution you come up with, you need to dogfood it if > possible with users who are only at a beginning-to-intermediate level > of NumPy or Python expertise. Does it get in the way? Does it require > constant tinkering with masks (if there is a boolean mask versus a > special NA value)? Intuitive? Hopefully I can take whatever result > comes of this development effort and change pandas to be implemented > on top of it without changing the existing API / behavior in any > significant ways. If I cannot, I will be (very, very) sad. > I'll definitely try to do this as best I can, -Mark > (I don't mean to be overly critical of numpy.ma-- I just care about > solving problems and making the tools as easy-to-use and intuitive as > possible.) > > - Wes > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 15:38:14 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 14:38:14 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <4E04834F.6060708@gmail.com> <7CE5BD47-5CD7-44B3-9C9C-FB78D338295A@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 1:04 PM, Matthew Brett wrote: > Hi, > > Just as a use case, if I do this: > > a = np.zeros((big_number,), dtype=np.int32) > a[0,0] = np.NA > > I think I'm right in saying that, with the array.mask implementation > my array memory usage with grow by new big_number bytes, whereas with > the np.naint32 implementation you'd get something like: > > Error('This data type does not allow missing values') > > Is that right? > Not really, I much prefer having the operation of adding a mask always be very explicit. It should raise an exception along the lines of "Cannot assign the NA missing value to an array with no validity mask". -Mark > See y'all, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srean.list at gmail.com Fri Jun 24 15:38:18 2011 From: srean.list at gmail.com (srean) Date: Fri, 24 Jun 2011 14:38:18 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator In-Reply-To: References: Message-ID: A valiant exercise in hope: Is this possible to do it without a loop or extra copying. What I have is an iterator that yields a fixed with string on every call to next(). Now I want to create a numpy array of ints out of the last 4 chars of that string. My plan was to pass the iterator through a generator that returned an iterator over the last 4 chars. (sub question: given that strings are immutable, is it possible to yield a view of the last 4 chars rather than a copy). Then apply StringIO.writelines() on the 2-char iterator returned. After its done, create a numpy.array from the StringIO's buffer. This does not work, the other option is to use an array.array in place of a StringIO object. But is it possible to fill an array.array using a lazy iterator without an explicit loop in python. Something like the writelines() call I know premature optimization and all that, but this indeed needs to be done efficiently Thanks again for your gracious help -- srean On Fri, Jun 24, 2011 at 9:12 AM, Robert Kern wrote: > On Fri, Jun 24, 2011 at 04:03, srean wrote: > > To answer my own question, I guess I can keep appending to a > array.array() > > object and get a numpy.array from its buffer if possible. Is that the > > efficient way. > > It's one of the most efficient ways to do it, yes, especially for 1D > arrays. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 15:46:41 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 14:46:41 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 1:18 PM, Matthew Brett wrote: > Hi, > > On Fri, Jun 24, 2011 at 5:45 PM, Mark Wiebe wrote: > > On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith wrote: > ... > >> and the fact that 'missing_value' could be any type would make the > >> code more complicated than the current case where the mask is always > >> bools or something? > > > > I'm referring to the underlying C implementations of the dtypes and any > > additional custom dtypes that people create. With the masked approach, > you > > implement a new custom data type in C, and it automatically works with > > missing data. With the custom dtype approach, you have to do a lot more > > error-prone work to handle the special values in all the ufuncs. > > This is just pure ignorance on my part, but I can see that the ufuncs > need to handle the missing values, but I can't see immediately why > that will be much more complicated than the 'every array might have a > mask' implementation. This was what I was trying to say with my silly > sketch: > > missing_value = np.dtype.missing_value > > for e in oned_array: > if e == missing_value: > > well - you get the idea. Obviously this is what you've been thinking > about, I just wanted to get a grasp of where the extra complexity is > coming from compared to: > > for i, e in enumerate(one_d_array): > if one_d_array.mask[i] == False: > With the masked approach, I plan to add a default mask support mechanism to the ufunc which calls the unmasked loop on the chunks that it can, then give the ability for a specific ufunc to provide faster mask-aware inner loops. Doing this in a general way for individual na* dtypes would require more special cases to handle different dtype sizes and specifics. -Mark > > Cheers, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmp50 at ukr.net Fri Jun 24 15:55:09 2011 From: tmp50 at ukr.net (Dmitrey) Date: Fri, 24 Jun 2011 22:55:09 +0300 Subject: [Numpy-discussion] [ANN] Numerical integration with guaranteed precision by interalg Message-ID: Hi all, some ideas implemented in the solver interalg (INTERval ALGorithm) that already turn out to be more effective than its competitors in numerical optimization (benchmark) appears to be extremely effective in numerical integration with guaranteed precision. Here are some examples where interalg works perfectly while scipy.integrate solvers fail to solve the problems and lie about obtained residual: * 1-D (vs scipy.integrate quad) > * 2-D (vs scipy.integrate dblquad) > * 3-D (vs scipy.integrate tplquad) > see http://openopt.org/IP for more details. Regards, D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Fri Jun 24 16:07:44 2011 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jun 2011 15:07:44 -0500 Subject: [Numpy-discussion] How to avoid extra copying when forming an array from an iterator In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 14:38, srean wrote: > A valiant exercise in hope: > > Is this possible to do it without a loop or extra copying. What I have is an > iterator that yields a fixed with string on every call to next(). Now I want > to create a numpy array of ints out of the last 4 chars of that string. > > My plan was to pass the iterator through a generator that returned an > iterator over the last 4 chars. (sub question: given that strings are > immutable, is it possible to yield a view of the last 4 chars rather than a > copy). Yes, but there isn't much point to it. > Then apply StringIO.writelines() on the 2-char iterator returned. > After its done, create a numpy.array from the StringIO's buffer. > > This does not work, the other option is to use an array.array in place of a > StringIO object. But is it possible to fill an array.array using a lazy > iterator without an explicit loop in python. Something like the writelines() > call Your generator that is yielding the last four bytes of the string fragments is already a Python loop. Adding another is not much additional overhead. But you can also be clever a = array.array('c') map(a.extend, your_generator_of_4strings) b = np.frombuffer(a, dtype=np.int32) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From xscript at gmx.net Fri Jun 24 16:38:26 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Fri, 24 Jun 2011 22:38:26 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: (Mark Wiebe's message of "Thu, 23 Jun 2011 19:34:11 -0500") References: Message-ID: <87boxnhsn1.fsf@fulla.xlab.taz> Mark Wiebe writes: > It's should also be possible to accomplish a general > solution at the dtype level. We could have a 'dtype > factory' used like: ?np.zeros(10, dtype=np.maybe(float)) > where np.maybe(x) returns a new dtype whose storage size > is x.itemsize + 1, where the extra byte is used to store > missingness information. (There might be some annoying > alignment issues to deal with.) Then for each ufunc we > define a handler for the maybe dtype (or add a > special-case to the ufunc dispatch machinery) that checks > the missingness value and then dispatches to the ordinary > ufunc handler for the wrapped dtype. > The 'dtype factory' idea builds on the way I've structured > datetime as a parameterized type, but the thing that kills it > for me is the alignment problems of 'x.itemsize + 1'. Having > the mask in a separate memory block is a lot better than > having to store 16 bytes for an 8-byte int to preserve the > alignment. > Yes, but that assumes it is appended to the existing types in the > dtype individually instead of the dtype as a whole. The dtype with > mask could just indicate a shadow array, an alpha channel if you > will, that is essentially what you are already doing but just > probide a different place to track it. > This would seem to change the definition of a dtype - currently it > represents a contiguous block of memory. It doesn't need to use all of > that memory, but the dtype conceptually owns it. I kind of like it > that way, where the whole strides idea with data being all over memory > space belonging to ndarray, not dtype. I don't havy any knowledge on the numpy or ma internals, so this might well be nonsense. Increasing the dtype item size would certainly decrease performance when using big structures, as it will require higher memory bandwidth. Why not use structured arrays? (assuming each struct element has indeed its own buffer, otherwise it's the same as having a "bigger" dtype) Then you can have some "blessed" struct elements, like the mask, which influence on how to print the array or how other struct elements must be operated. Besides, using "blessed" struct elements falls in line with the recent "_ufunc_wrapper_" proposal. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From ben.root at ou.edu Fri Jun 24 17:09:16 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 24 Jun 2011 16:09:16 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:40 AM, Mark Wiebe wrote: > On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root wrote: > >> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: >> >>> Sorry y'all, I'm just commenting bits by bits: >>> >>> "One key problem is a lack of orthogonality with other features, for >>> instance creating a masked array with physical quantities can't be done >>> because both are separate subclasses of ndarray. The only reasonable way to >>> deal with this is to move the mask into the core ndarray." >>> >>> Meh. I did try to make it easy to use masked arrays on top of subclasses. >>> There's even some tests in the suite to that effect (test_subclassing). I'm >>> not buying the argument. >>> About moving mask in the core ndarray: I had suggested back in the days >>> to have a mask flag/property built-in ndarrays (which would *really* have >>> simplified the game), but this suggestion was dismissed very quickly as >>> adding too much overload. I had to agree. I'm just a tad surprised the wind >>> has changed on that matter. >>> >>> >>> "In the current masked array, calculations are done for the whole array, >>> then masks are patched up afterwords. This means that invalid calculations >>> sitting in masked elements can raise warnings or exceptions even though they >>> shouldn't, so the ufunc error handling mechanism can't be relied on." >>> >>> Well, there's a reason for that. Initially, I tried to guess what the >>> mask of the output should be from the mask of the inputs, the objective >>> being to avoid getting NaNs in the C array. That was easy in most cases, >>> but it turned out it wasn't always possible (the `power` one caused me a >>> lot of issues, if I recall correctly). So, for performance issues (to avoid >>> a lot of expensive tests), I fell back on the old concept of "compute them >>> all, they'll be sorted afterwards". >>> Of course, that's rather clumsy an approach. But it works not too badly >>> when in pure Python. No doubt that a proper C implementation would work >>> faster. >>> Oh, about using NaNs for invalid data ? Well, can't work with integers. >>> >>> `mask` property: >>> Nothing to add to it. It's basically what we have now (except for the >>> opposite convention). >>> >>> Working with masked values: >>> I recall some strong points back in the days for not using None to >>> represent missing values... >>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global >>> flag like we have now. >>> >>> Design questions: >>> Adding `masked` or whatever we call it to a number/array should result is >>> masked/a fully masked array, period. That way, we can have an idea that >>> something was wrong with the initial dataset. >>> hardmask: I never used the feature myself. I wonder if anyone did. Still, >>> it's a nice idea... >>> >> >> As a heavy masked_array user, I regret not being able to participate more >> in this discussion as I am madly cranking out matplotlib code. I would like >> to say that I have always seen masked arrays as being the "next step up" >> from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts >> are powerful, and I don't think they have yet to be exploited to their >> fullest potential. >> > > Do you have some examples where hardmask or sharedmask are being used? I > like the idea of using a hardmask array as the return value for boolean > indexing, but some more use cases would be nice. > > At one point I did have something for soft/hard masks, but I think my final implementation went a different direction. I would have to look around. I do have a good use-case for soft masks. For a given data, I wanted to produce several pcolors highlighting different regions. A soft mask provided me a quick-n-easy way to change the mask without having to produce many copies of the original data. > Masks are (relatively) easy when dealing with element-by-element operations >> that produces an array of the same shape (or at least the same number of >> elements in the case of reshape and transpose). What gets difficult is for >> reductions such as sum or max, etc. Then you get into the weirder cases >> such as unwrap and gradients that I brought up recently. I am not sure how >> to address this, but I am not a fan of the idea of adding yet another >> parameter to the ufuncs to determine what to do for filling in a mask. >> > > It looks like in R there is a parameter called na.rm=T/F, which basically > means "remove NAs before doing the computation". This approach seems good to > me for reduction operations. > > Just to throw out some examples where these settings really do not make much sense. For gradients and unwrap, maybe you want to skip na's, but still record the number of points you are skipping or maybe the points at na-boundaries become na's themselves. Are we going to have something for each one of these possibilities? Of course, this isn't even very well dealt with in masked arrays right now. Another example of how we use masks in matplotlib is in pcolor(). We have to combine the possible masks of X, Y, and V in both the x and y directions to find the final mask to use for the final output result (because each facet needs valid data at each corner). Having a soft-mask implementation allows one to create a temporary mask to use for the operation, and to share that mask across all the input data, but then let the data structures retain their original masks when done. > Also, just to make things messier, there is an incomplete feature that was >> made for record arrays with regards to masking. The idea was to allow for >> element-by-element masking, but also allow for row-by-row (or was it >> column-by-column?) masking. I thought it was a neat feature, and it is too >> bad that it was not finished. >> > > I put this in my design, I think this would be useful too. I would call it > field by field, though many people like thinking of the struct dtype fields > as columns. > > Fields are fine. I have found that there is no real consistency with how professionals refer to their rows and columns as "records" and "fields". I learned data-handling from working on databases, but my naming convention often clashes with my some of my committee members who come from a stats background. > Anyway, my opinion is that a mask should be True for a value that needs >> to be hidden. Do not change this convention. People coming into python >> already has to change code, a simple bit flip for them should be fine. >> Breaking existing python code is worse. >> > > I'm now thinking the mask needs to be pushed away into the background to > where it becomes be an unimportant implementation detail of the system. It > deserves a long cumbersome name like "validitymask", and then the system can > use something close R's approach with an NA-like singleton for most > operations. > Don't lose sight that we are really talking about two orthogonal (albeit, seemingly similar) concepts. "missing" data and "ambiguous" data. Both of these tools need to be at the forefront and the distinction needs to be made clear to the users so that they know which one they need in what situation. I think hiding masks is a bad idea. I want numpy to be *better* than R by offering both features in a clear, non-conflicting manner. On a note somewhat similar to what I pointing out earlier with regards to soft masks. One thing that is very nice about masked_arrays is that I can at any time turn a regular numpy array into a masked array without paying a penalty of having to re-assign the data. Just need to make a separate mask object. This is different from how one would operate with a na-dtype approach, where converting an array with a regular dtype into a na-dtype array would require a copy. However, with proper dtype-handling, this may not be of much concern (non-na-dtype + na-dtype --> na-dtype, much like how int + float --> float). Also loading functions could be told to cast to a na-dtype, which would then result in an array that is ready "out-of-the-box" as opposed to casting the masked array after the creation of the regular ndarray from a function like np.loadtxt(). Again, there are pros and cons either way and I see them very orthogonal and complementary. Heck, I could even imagine situations where one might want a mask over an array with a na-dtype. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jun 24 17:24:11 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 14:24:11 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 12:26 PM, Mark Wiebe wrote: > For the maybe dtype, it would need to gain access to the ufunc loop of the > underlying dtype, and call it appropriately during the inner loop. This > appears to require some more invasive upheaval within the ufunc code than > the masking approach. Not really -- it can just set up some shim ndarray structures, and then use the standard top-level ufunc entry-point. (Well, this does assume that ufunc machinery is re-entrant. If it uses global variables or something to hold per-call state then we might be in trouble.) -- Nathaniel From matthew.brett at gmail.com Fri Jun 24 18:21:45 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 24 Jun 2011 23:21:45 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root wrote: ... > Again, there are pros and cons either way and I see them very orthogonal and > complementary. That may be true, but I imagine only one of them will be implemented. @Mark - I don't have a clear idea whether you consider the nafloat64 option to be still in play as the first thing to be implemented (before array.mask). If it is, what kind of thing would persuade you either way? See you, Matthew From gael.varoquaux at normalesup.org Fri Jun 24 18:58:32 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 25 Jun 2011 00:58:32 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <20110624225832.GE3847@phare.normalesup.org> On Thu, Jun 23, 2011 at 07:51:25PM -0400, josef.pktd at gmail.com wrote: > From the perspective of statistical analysis, I don't see much > advantage of this. What to do with nans depends on the analysis, and > needs to be looked at for each case. >From someone who actually sometimes does statistics with missing data, I would like to echo this feeling. The hard problem is to write your statistical code (estimator, or test statistics) in such a way that it can deal with missing data. The actually container/marker to store the data is quite minor. G From charlesr.harris at gmail.com Fri Jun 24 19:10:26 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 17:10:26 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett wrote: > Hi, > > On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root wrote: > ... > > Again, there are pros and cons either way and I see them very orthogonal > and > > complementary. > > That may be true, but I imagine only one of them will be implemented. > > @Mark - I don't have a clear idea whether you consider the nafloat64 > option to be still in play as the first thing to be implemented > (before array.mask). If it is, what kind of thing would persuade you > either way? > > Mark can speak for himself, but I think things are tending towards masks. They have the advantage of one implementation for all data types, current and future, and they are more flexible since the masked data can be actual valid data that you just choose to ignore for experimental reasons. What might be helpful is a routine to import/export R files, but that shouldn't be to difficult to implement. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jun 24 19:22:11 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 24 Jun 2011 19:22:11 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris wrote: > > > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root wrote: >> ... >> > Again, there are pros and cons either way and I see them very orthogonal >> > and >> > complementary. >> >> That may be true, but I imagine only one of them will be implemented. >> >> @Mark - I don't have a clear idea whether you consider the nafloat64 >> option to be still in play as the first thing to be implemented >> (before array.mask). ? If it is, what kind of thing would persuade you >> either way? >> > > Mark can speak for himself,? but I think things are tending towards masks. > They have the advantage of one implementation for all data types, current > and future, and they are more flexible since the masked data can be actual > valid data that you just choose to ignore for experimental? reasons. > > What might be helpful is a routine to import/export R files, but that > shouldn't be to difficult to implement. > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > Perhaps we should make a wiki page someplace summarizing pros and cons of the various implementation approaches? I worry very seriously about adding API functions relating to masks rather than having special NA values which propagate in algorithms. The question is: will Joe Blow Former R user have to understand what is the mask and how to work with it? If the answer is yes we have a problem. If it can be completely hidden as an implementation detail, that's great. In R NAs are just sort of inherent-- they propagate you deal with them when you have to via na.rm flag in functions or is.na. The other problem I can think of with masks is the extra memory footprint, though maybe this is no cause for concern. -W From matthew.brett at gmail.com Fri Jun 24 20:02:07 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 01:02:07 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney wrote: ... > Perhaps we should make a wiki page someplace summarizing pros and cons > of the various implementation approaches? But - we should do this if it really is an open question which one we go for. If not then, we're just slowing Mark down in getting to the implementation. Assuming the question is still open, here's a starter for the pros and cons: array.mask 1) It's easier / neater to implement 2) It can generalize across dtypes 3) You can still get the masked data underneath the mask (allowing you to unmask etc) nafloat64: 1) No memory overhead 2) Battle-tested implementation already done in R I guess we'd have to test directly whether the non-continuous memory of the mask and data would cause enough cache-miss problems to outweigh the potential cycle-savings from single byte comparisons in array.mask. I guess that one and only one of these will get written. I guess that one of these choices may be a lot more satisfying to the current and future masked array itch than the other. I'm personally worried that the memory overhead of array.masks will make many of us tend to avoid them. I work with images that can easily get large enough that I would not want an array-items size byte array added to my storage. The reason I'm asking for more details about the implementation is because that is most of the argument for array.mask at the moment (1 and 2 above). See you, Matthew From charlesr.harris at gmail.com Fri Jun 24 20:02:22 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 18:02:22 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris > wrote: > > > > > > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root > wrote: > >> ... > >> > Again, there are pros and cons either way and I see them very > orthogonal > >> > and > >> > complementary. > >> > >> That may be true, but I imagine only one of them will be implemented. > >> > >> @Mark - I don't have a clear idea whether you consider the nafloat64 > >> option to be still in play as the first thing to be implemented > >> (before array.mask). If it is, what kind of thing would persuade you > >> either way? > >> > > > > Mark can speak for himself, but I think things are tending towards > masks. > > They have the advantage of one implementation for all data types, current > > and future, and they are more flexible since the masked data can be > actual > > valid data that you just choose to ignore for experimental reasons. > > > > What might be helpful is a routine to import/export R files, but that > > shouldn't be to difficult to implement. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > Perhaps we should make a wiki page someplace summarizing pros and cons > of the various implementation approaches? I worry very seriously about > adding API functions relating to masks rather than having special NA > values which propagate in algorithms. The question is: will Joe Blow > Former R user have to understand what is the mask and how to work with > it? If the answer is yes we have a problem. If it can be completely > hidden as an implementation detail, that's great. In R NAs are just > sort of inherent-- they propagate you deal with them when you have to > via na.rm flag in functions or is.na. > > Well, I think both of those can be pretty transparent. Could you illustrate some typical R usage, to wit. 1) setting a value to na 2) checking a value for na Other things are problematic, like checking for integer overflow. For safety that would be desireable, for speed not. I think that is a separate question however. In any case, if we do check such things we should be able to set the corresponding mask value in the loop, and I suppose that is the sort of thing you want. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jun 24 20:11:46 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 24 Jun 2011 20:11:46 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris wrote: > > > On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney wrote: >> >> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris >> wrote: >> > >> > >> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett >> > wrote: >> >> >> >> Hi, >> >> >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root >> >> wrote: >> >> ... >> >> > Again, there are pros and cons either way and I see them very >> >> > orthogonal >> >> > and >> >> > complementary. >> >> >> >> That may be true, but I imagine only one of them will be implemented. >> >> >> >> @Mark - I don't have a clear idea whether you consider the nafloat64 >> >> option to be still in play as the first thing to be implemented >> >> (before array.mask). ? If it is, what kind of thing would persuade you >> >> either way? >> >> >> > >> > Mark can speak for himself,? but I think things are tending towards >> > masks. >> > They have the advantage of one implementation for all data types, >> > current >> > and future, and they are more flexible since the masked data can be >> > actual >> > valid data that you just choose to ignore for experimental? reasons. >> > >> > What might be helpful is a routine to import/export R files, but that >> > shouldn't be to difficult to implement. >> > >> > Chuck >> > >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> >> Perhaps we should make a wiki page someplace summarizing pros and cons >> of the various implementation approaches? I worry very seriously about >> adding API functions relating to masks rather than having special NA >> values which propagate in algorithms. The question is: will Joe Blow >> Former R user have to understand what is the mask and how to work with >> it? If the answer is yes we have a problem. If it can be completely >> hidden as an implementation detail, that's great. In R NAs are just >> sort of inherent-- they propagate you deal with them when you have to >> via na.rm flag in functions or is.na. >> > > Well, I think both of those can be pretty transparent. Could you illustrate > some typical R usage, to wit. > > 1) setting a value to na > 2) checking a value for na > > Other things are problematic, like checking for integer overflow. For safety > that would be desireable, for speed not. I think that is a separate question > however. In any case, if we do check such things we should be able to set > the corresponding mask value in the loop, and I suppose that is the sort of > thing you want. > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > I think anyone making decisions about this needs to have a pretty good understanding of what R does. So here's some examples but you guys really need to spend some time with R if you have not already arr <- rnorm(20) arr [1] 1.341960278 0.757033314 -0.910468762 -0.475811935 -0.007973053 [6] 1.618201117 -0.965747088 0.386811224 0.229158237 0.987050613 [11] 1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583 [16] 0.720181047 0.846607030 0.486173088 -0.911247626 0.370326788 arr[5:10] = NA arr [1] 1.3419603 0.7570333 -0.9104688 -0.4758119 NA NA [7] NA NA NA NA 1.2934532 -2.4323990 [13] -0.2475935 -0.6397696 -0.4649966 0.7201810 0.8466070 0.4861731 [19] -0.9112476 0.3703268 is.na(arr) [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE mean(arr) [1] NA mean(arr, na.rm=T) [1] -0.01903945 arr + rnorm(20) [1] 2.081580297 0.505050028 -0.696287035 -1.280323279 NA [6] NA NA NA NA NA [11] 2.166078369 -1.445271291 0.764894624 0.795890929 0.549621207 [16] 0.005215596 -0.170001426 0.712335355 -0.919671745 -0.617099818 and obviously this is OK too: arr <- rep('wes', 10) arr[5:7] <- NA is.na(arr) [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE note, NA gets excluded from categorical variables (factors): as.factor(arr) [1] wes wes wes wes wes wes wes Levels: wes e.g. groupby with NA: > tapply(rnorm(10), arr, mean) wes -0.5271853 From mwwiebe at gmail.com Fri Jun 24 20:17:02 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 19:17:02 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <87boxnhsn1.fsf@fulla.xlab.taz> References: <87boxnhsn1.fsf@fulla.xlab.taz> Message-ID: On Fri, Jun 24, 2011 at 3:38 PM, Llu?s wrote: > Mark Wiebe writes: > > > It's should also be possible to accomplish a general > > solution at the dtype level. We could have a 'dtype > > factory' used like: np.zeros(10, dtype=np.maybe(float)) > > where np.maybe(x) returns a new dtype whose storage size > > is x.itemsize + 1, where the extra byte is used to store > > missingness information. (There might be some annoying > > alignment issues to deal with.) Then for each ufunc we > > define a handler for the maybe dtype (or add a > > special-case to the ufunc dispatch machinery) that checks > > the missingness value and then dispatches to the ordinary > > ufunc handler for the wrapped dtype. > > > > The 'dtype factory' idea builds on the way I've structured > > datetime as a parameterized type, but the thing that kills it > > for me is the alignment problems of 'x.itemsize + 1'. Having > > the mask in a separate memory block is a lot better than > > having to store 16 bytes for an 8-byte int to preserve the > > alignment. > > > > Yes, but that assumes it is appended to the existing types in the > > dtype individually instead of the dtype as a whole. The dtype with > > mask could just indicate a shadow array, an alpha channel if you > > will, that is essentially what you are already doing but just > > probide a different place to track it. > > > > This would seem to change the definition of a dtype - currently it > > represents a contiguous block of memory. It doesn't need to use all of > > that memory, but the dtype conceptually owns it. I kind of like it > > that way, where the whole strides idea with data being all over memory > > space belonging to ndarray, not dtype. > > I don't havy any knowledge on the numpy or ma internals, so this might > well be nonsense. > > Increasing the dtype item size would certainly decrease performance when > using big structures, as it will require higher memory bandwidth. > > Why not use structured arrays? (assuming each struct element has indeed > its own buffer, otherwise it's the same as having a "bigger" dtype) Then > you can have some "blessed" struct elements, like the mask, which > influence on how to print the array or how other struct elements must be > operated. > Structured arrays do put their fields next to each other in memory, so this is basically like having a bigger dtype. -Mark > Besides, using "blessed" struct elements falls in line with the recent > "_ufunc_wrapper_" proposal. > > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 24 20:24:23 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 18:24:23 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 6:11 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris > wrote: > > > > > > On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney > wrote: > >> > >> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett < > matthew.brett at gmail.com> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root > >> >> wrote: > >> >> ... > >> >> > Again, there are pros and cons either way and I see them very > >> >> > orthogonal > >> >> > and > >> >> > complementary. > >> >> > >> >> That may be true, but I imagine only one of them will be implemented. > >> >> > >> >> @Mark - I don't have a clear idea whether you consider the nafloat64 > >> >> option to be still in play as the first thing to be implemented > >> >> (before array.mask). If it is, what kind of thing would persuade > you > >> >> either way? > >> >> > >> > > >> > Mark can speak for himself, but I think things are tending towards > >> > masks. > >> > They have the advantage of one implementation for all data types, > >> > current > >> > and future, and they are more flexible since the masked data can be > >> > actual > >> > valid data that you just choose to ignore for experimental reasons. > >> > > >> > What might be helpful is a routine to import/export R files, but that > >> > shouldn't be to difficult to implement. > >> > > >> > Chuck > >> > > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at scipy.org > >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > > >> > >> Perhaps we should make a wiki page someplace summarizing pros and cons > >> of the various implementation approaches? I worry very seriously about > >> adding API functions relating to masks rather than having special NA > >> values which propagate in algorithms. The question is: will Joe Blow > >> Former R user have to understand what is the mask and how to work with > >> it? If the answer is yes we have a problem. If it can be completely > >> hidden as an implementation detail, that's great. In R NAs are just > >> sort of inherent-- they propagate you deal with them when you have to > >> via na.rm flag in functions or is.na. > >> > > > > Well, I think both of those can be pretty transparent. Could you > illustrate > > some typical R usage, to wit. > > > > 1) setting a value to na > > 2) checking a value for na > > > > Other things are problematic, like checking for integer overflow. For > safety > > that would be desireable, for speed not. I think that is a separate > question > > however. In any case, if we do check such things we should be able to set > > the corresponding mask value in the loop, and I suppose that is the sort > of > > thing you want. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > I think anyone making decisions about this needs to have a pretty good > understanding of what R does. So here's some examples but you guys > really need to spend some time with R if you have not already > > arr <- rnorm(20) > arr > [1] 1.341960278 0.757033314 -0.910468762 -0.475811935 -0.007973053 > [6] 1.618201117 -0.965747088 0.386811224 0.229158237 0.987050613 > [11] 1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583 > [16] 0.720181047 0.846607030 0.486173088 -0.911247626 0.370326788 > arr[5:10] = NA > arr > [1] 1.3419603 0.7570333 -0.9104688 -0.4758119 NA NA > [7] NA NA NA NA 1.2934532 -2.4323990 > [13] -0.2475935 -0.6397696 -0.4649966 0.7201810 0.8466070 0.4861731 > [19] -0.9112476 0.3703268 > is.na(arr) > [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE > FALSE > [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > mean(arr) > [1] NA > mean(arr, na.rm=T) > [1] -0.01903945 > > arr + rnorm(20) > [1] 2.081580297 0.505050028 -0.696287035 -1.280323279 NA > [6] NA NA NA NA NA > [11] 2.166078369 -1.445271291 0.764894624 0.795890929 0.549621207 > [16] 0.005215596 -0.170001426 0.712335355 -0.919671745 -0.617099818 > > and obviously this is OK too: > > arr <- rep('wes', 10) > arr[5:7] <- NA > is.na(arr) > [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE > > note, NA gets excluded from categorical variables (factors): > as.factor(arr) > [1] wes wes wes wes wes wes wes > Levels: wes > > e.g. groupby with NA: > > > tapply(rnorm(10), arr, mean) > wes > -0.5271853 > I think those are all doable. The main concerns I have at the moment are: 1) Tracking things like integer overflow, yes, no. 2) Memory. I suppose masks could be packed into bits if it came to that. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 20:39:41 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 19:39:41 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 4:09 PM, Benjamin Root wrote: > > > On Fri, Jun 24, 2011 at 10:40 AM, Mark Wiebe wrote: > >> On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root wrote: >> >>> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM wrote: >>> >>>> Sorry y'all, I'm just commenting bits by bits: >>>> >>>> "One key problem is a lack of orthogonality with other features, for >>>> instance creating a masked array with physical quantities can't be done >>>> because both are separate subclasses of ndarray. The only reasonable way to >>>> deal with this is to move the mask into the core ndarray." >>>> >>>> Meh. I did try to make it easy to use masked arrays on top of >>>> subclasses. There's even some tests in the suite to that effect >>>> (test_subclassing). I'm not buying the argument. >>>> About moving mask in the core ndarray: I had suggested back in the days >>>> to have a mask flag/property built-in ndarrays (which would *really* have >>>> simplified the game), but this suggestion was dismissed very quickly as >>>> adding too much overload. I had to agree. I'm just a tad surprised the wind >>>> has changed on that matter. >>>> >>>> >>>> "In the current masked array, calculations are done for the whole array, >>>> then masks are patched up afterwords. This means that invalid calculations >>>> sitting in masked elements can raise warnings or exceptions even though they >>>> shouldn't, so the ufunc error handling mechanism can't be relied on." >>>> >>>> Well, there's a reason for that. Initially, I tried to guess what the >>>> mask of the output should be from the mask of the inputs, the objective >>>> being to avoid getting NaNs in the C array. That was easy in most cases, >>>> but it turned out it wasn't always possible (the `power` one caused me a >>>> lot of issues, if I recall correctly). So, for performance issues (to avoid >>>> a lot of expensive tests), I fell back on the old concept of "compute them >>>> all, they'll be sorted afterwards". >>>> Of course, that's rather clumsy an approach. But it works not too badly >>>> when in pure Python. No doubt that a proper C implementation would work >>>> faster. >>>> Oh, about using NaNs for invalid data ? Well, can't work with integers. >>>> >>>> `mask` property: >>>> Nothing to add to it. It's basically what we have now (except for the >>>> opposite convention). >>>> >>>> Working with masked values: >>>> I recall some strong points back in the days for not using None to >>>> represent missing values... >>>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global >>>> flag like we have now. >>>> >>>> Design questions: >>>> Adding `masked` or whatever we call it to a number/array should result >>>> is masked/a fully masked array, period. That way, we can have an idea that >>>> something was wrong with the initial dataset. >>>> hardmask: I never used the feature myself. I wonder if anyone did. >>>> Still, it's a nice idea... >>>> >>> >>> As a heavy masked_array user, I regret not being able to participate more >>> in this discussion as I am madly cranking out matplotlib code. I would like >>> to say that I have always seen masked arrays as being the "next step up" >>> from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts >>> are powerful, and I don't think they have yet to be exploited to their >>> fullest potential. >>> >> >> Do you have some examples where hardmask or sharedmask are being used? I >> like the idea of using a hardmask array as the return value for boolean >> indexing, but some more use cases would be nice. >> >> > > At one point I did have something for soft/hard masks, but I think my final > implementation went a different direction. I would have to look around. I > do have a good use-case for soft masks. For a given data, I wanted to > produce several pcolors highlighting different regions. A soft mask > provided me a quick-n-easy way to change the mask without having to produce > many copies of the original data. > That sounds cool, matplotlib will be a good place to do test modifications while I'm doing the implementation. > Masks are (relatively) easy when dealing with element-by-element operations >>> that produces an array of the same shape (or at least the same number of >>> elements in the case of reshape and transpose). What gets difficult is for >>> reductions such as sum or max, etc. Then you get into the weirder cases >>> such as unwrap and gradients that I brought up recently. I am not sure how >>> to address this, but I am not a fan of the idea of adding yet another >>> parameter to the ufuncs to determine what to do for filling in a mask. >>> >> >> It looks like in R there is a parameter called na.rm=T/F, which basically >> means "remove NAs before doing the computation". This approach seems good to >> me for reduction operations. >> >> > Just to throw out some examples where these settings really do not make > much sense. For gradients and unwrap, maybe you want to skip na's, but > still record the number of points you are skipping or maybe the points at > na-boundaries become na's themselves. Are we going to have something for > each one of these possibilities? Of course, this isn't even very well dealt > with in masked arrays right now. > Yeah, for some functions dealing with NA values will need individual per-function care. Probably they should raise by default until NA support is implemented for them. Another example of how we use masks in matplotlib is in pcolor(). We have > to combine the possible masks of X, Y, and V in both the x and y directions > to find the final mask to use for the final output result (because each > facet needs valid data at each corner). Having a soft-mask implementation > allows one to create a temporary mask to use for the operation, and to share > that mask across all the input data, but then let the data structures retain > their original masks when done. > I will look at the implementation. > Also, just to make things messier, there is an incomplete feature that was >>> made for record arrays with regards to masking. The idea was to allow for >>> element-by-element masking, but also allow for row-by-row (or was it >>> column-by-column?) masking. I thought it was a neat feature, and it is too >>> bad that it was not finished. >>> >> >> I put this in my design, I think this would be useful too. I would call it >> field by field, though many people like thinking of the struct dtype fields >> as columns. >> >> > Fields are fine. I have found that there is no real consistency with how > professionals refer to their rows and columns as "records" and "fields". I > learned data-handling from working on databases, but my naming convention > often clashes with my some of my committee members who come from a stats > background. > I prefer considering them like C structs, which is why I've started calling them "struct dtypes". That name is also shorter than "structured dtypes". > Anyway, my opinion is that a mask should be True for a value that needs >>> to be hidden. Do not change this convention. People coming into python >>> already has to change code, a simple bit flip for them should be fine. >>> Breaking existing python code is worse. >>> >> >> I'm now thinking the mask needs to be pushed away into the background to >> where it becomes be an unimportant implementation detail of the system. It >> deserves a long cumbersome name like "validitymask", and then the system can >> use something close R's approach with an NA-like singleton for most >> operations. >> > > Don't lose sight that we are really talking about two orthogonal (albeit, > seemingly similar) concepts. "missing" data and "ambiguous" data. Both of > these tools need to be at the forefront and the distinction needs to be made > clear to the users so that they know which one they need in what situation. > I think hiding masks is a bad idea. I want numpy to be *better* than R by > offering both features in a clear, non-conflicting manner. > That sounds good to me, we'll have to go through several design iterations to shake out the details. On a note somewhat similar to what I pointing out earlier with regards to > soft masks. One thing that is very nice about masked_arrays is that I can > at any time turn a regular numpy array into a masked array without paying a > penalty of having to re-assign the data. Just need to make a separate mask > object. > I believe my design sufficiently allows for this. This is different from how one would operate with a na-dtype approach, where > converting an array with a regular dtype into a na-dtype array would require > a copy. However, with proper dtype-handling, this may not be of much > concern (non-na-dtype + na-dtype --> na-dtype, much like how int + float --> > float). Also loading functions could be told to cast to a na-dtype, which > would then result in an array that is ready "out-of-the-box" as opposed to > casting the masked array after the creation of the regular ndarray from a > function like np.loadtxt(). > The syntax for casting each element of a struct dtype to a new struct dtype with all na-dtypes would be clumsy at first, there's a bunch of things that would have to be figured out to make that all play nicely. > Again, there are pros and cons either way and I see them very orthogonal > and complementary. Heck, I could even imagine situations where one might > want a mask over an array with a na-dtype. > Maybe, but I'm kind of liking the idea of both of these use cases being handled by the same underlying mechanism. I've updated the NEP, and will let it bake for a bit I think. -Mark > > Ben Root > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 20:41:27 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 19:41:27 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: On Fri, Jun 24, 2011 at 4:24 PM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 12:26 PM, Mark Wiebe wrote: > > For the maybe dtype, it would need to gain access to the ufunc loop of > the > > underlying dtype, and call it appropriately during the inner loop. This > > appears to require some more invasive upheaval within the ufunc code than > > the masking approach. > > Not really -- it can just set up some shim ndarray structures, and > then use the standard top-level ufunc entry-point. (Well, this does > assume that ufunc machinery is re-entrant. If it uses global variables > or something to hold per-call state then we might be in trouble.) > That sounds like it would be very slow to me. Once you're in the C world, especially in an inner loop, going back into the Python world is not a good idea from a performance perspective. -Mark > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 20:54:53 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 19:54:53 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 5:21 PM, Matthew Brett wrote: > Hi, > > On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root wrote: > ... > > Again, there are pros and cons either way and I see them very orthogonal > and > > complementary. > > That may be true, but I imagine only one of them will be implemented. > > @Mark - I don't have a clear idea whether you consider the nafloat64 > option to be still in play as the first thing to be implemented > (before array.mask). If it is, what kind of thing would persuade you > either way? > I'm focusing all of my effort on getting my proposal of adding a mask to the core ndarray into a state where it satisfies everyone's requirements as best I can. I'm not precluding the possibility that someone could convince me that the na-dtype is good, but I gave it a good chunk of thought before starting to write the proposal. To persuade me towards the na-dtype option, I need to be convinced that I'm solving the problem class in a generic way that works orthogonally with other features, with manageable implementation requirements, a very usable result for both strong and weak programmers, and with good performance characteristics. I think the na-dtype approach isn't as generic as I would like, and the implementation seems like it would be trickier than the masked approach. Thanks, Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 20:56:58 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 19:56:58 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 6:10 PM, Charles R Harris wrote: > > > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett wrote: > >> Hi, >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root wrote: >> ... >> > Again, there are pros and cons either way and I see them very orthogonal >> and >> > complementary. >> >> That may be true, but I imagine only one of them will be implemented. >> >> @Mark - I don't have a clear idea whether you consider the nafloat64 >> option to be still in play as the first thing to be implemented >> (before array.mask). If it is, what kind of thing would persuade you >> either way? >> >> > Mark can speak for himself, but I think things are tending towards masks. > They have the advantage of one implementation for all data types, current > and future, and they are more flexible since the masked data can be actual > valid data that you just choose to ignore for experimental reasons. > These are all things I like about it too. > What might be helpful is a routine to import/export R files, but that > shouldn't be to difficult to implement. > There is another intern here who has started something along these lines. I'll tell him he should email the list too. ;) -Mark > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 21:00:47 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 20:00:47 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 6:22 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris > wrote: > > > > > > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root > wrote: > >> ... > >> > Again, there are pros and cons either way and I see them very > orthogonal > >> > and > >> > complementary. > >> > >> That may be true, but I imagine only one of them will be implemented. > >> > >> @Mark - I don't have a clear idea whether you consider the nafloat64 > >> option to be still in play as the first thing to be implemented > >> (before array.mask). If it is, what kind of thing would persuade you > >> either way? > >> > > > > Mark can speak for himself, but I think things are tending towards > masks. > > They have the advantage of one implementation for all data types, current > > and future, and they are more flexible since the masked data can be > actual > > valid data that you just choose to ignore for experimental reasons. > > > > What might be helpful is a routine to import/export R files, but that > > shouldn't be to difficult to implement. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > Perhaps we should make a wiki page someplace summarizing pros and cons > of the various implementation approaches? I worry very seriously about > adding API functions relating to masks rather than having special NA > values which propagate in algorithms. The question is: will Joe Blow > Former R user have to understand what is the mask and how to work with > it? If the answer is yes we have a problem. If it can be completely > hidden as an implementation detail, that's great. In R NAs are just > sort of inherent-- they propagate you deal with them when you have to > via na.rm flag in functions or is.na. > I think the interface for how it looks in NumPy can be made to be pretty close to the same with either design approach. I've updated the NEP to add and emphasize using masked values with an np.NA singleton, with the validitymask as the implementation mechanism which is still accessible for those who want to still deal with the mask directly. > The other problem I can think of with masks is the extra memory > footprint, though maybe this is no cause for concern. > The overhead is definitely worth considering, along with the extra memory traffic it generates, and I've basically concluded that the increased generality and flexibility is worth the added cost. -Mark > > -W > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Fri Jun 24 21:10:34 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Fri, 24 Jun 2011 20:10:34 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney > wrote: > ... > > Perhaps we should make a wiki page someplace summarizing pros and cons > > of the various implementation approaches? > > But - we should do this if it really is an open question which one we > go for. If not then, we're just slowing Mark down in getting to the > implementation. > > Assuming the question is still open, here's a starter for the pros and > cons: > > array.mask > 1) It's easier / neater to implement > Yes > 2) It can generalize across dtypes > Yes > 3) You can still get the masked data underneath the mask (allowing you > to unmask etc) > By setting up views appropriately, yes. If you don't have another view to the underlying data, you can't get at it. nafloat64: > 1) No memory overhead > Yes > 2) Battle-tested implementation already done in R > We can't really use that though, R is GPL and NumPy is BSD. The low-level implementation details are likely different enough that a re-implementation would be needed anyway. I guess we'd have to test directly whether the non-continuous memory > of the mask and data would cause enough cache-miss problems to > outweigh the potential cycle-savings from single byte comparisons in > array.mask. > The different memory buffers are each contiguous, so the access patterns still have a lot of coherency. I intend to give the mask memory layouts matching those of the arrays. I guess that one and only one of these will get written. I guess that > one of these choices may be a lot more satisfying to the current and > future masked array itch than the other. > I'm only going to implement one solution, yes. I'm personally worried that the memory overhead of array.masks will > make many of us tend to avoid them. I work with images that can > easily get large enough that I would not want an array-items size byte > array added to my storage. > May I ask what kind of dtypes and sizes you're working with? The reason I'm asking for more details about the implementation is > because that is most of the argument for array.mask at the moment (1 > and 2 above). > I'm first trying to nail down more of the higher level requirements before digging really deep into the implementation details. They greatly affect how those details have to turn out. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jun 24 21:11:58 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 18:11:58 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 2:09 PM, Benjamin Root wrote: > Another example of how we use masks in matplotlib is in pcolor().? We have > to combine the possible masks of X, Y, and V in both the x and y directions > to find the final mask to use for the final output result (because each > facet needs valid data at each corner).? Having a soft-mask implementation > allows one to create a temporary mask to use for the operation, and to share > that mask across all the input data, but then let the data structures retain > their original masks when done. This is a situation where I would just... use an array and a mask, rather than a masked array. Then lots of things -- changing fill values, temporarily masking/unmasking things, etc. -- come from free, just from knowing how arrays and boolean indexing work? Do we really get much advantage by building all these complex operations in? I worry that we're trying to anticipate and write code for every situation that users find themselves in, instead of just giving them some simple, orthogonal tools. As a corollary, I worry that learning and keeping track of how masked arrays work is more hassle than just ignoring them and writing the necessary code by hand as needed. Certainly I can imagine that *if the mask is a property of the data* then it's useful to have tools to keep it aligned with the data through indexing and such. But some of these other things are quicker to reimplement than to look up the docs for, and the reimplementation is easier to read, at least for me... (By the way, is this hard-mask/soft-mask stuff documented anywhere? I spent some time groveling over the numpy docs with google's help, and I still have no idea what it actually is.) -- Nathaniel From ben.root at ou.edu Fri Jun 24 21:25:47 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 24 Jun 2011 20:25:47 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:00 PM, Mark Wiebe wrote: > On Fri, Jun 24, 2011 at 6:22 PM, Wes McKinney wrote: > >> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris >> wrote: >> > >> > >> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett > > >> > wrote: >> >> >> >> Hi, >> >> >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root >> wrote: >> >> ... >> >> > Again, there are pros and cons either way and I see them very >> orthogonal >> >> > and >> >> > complementary. >> >> >> >> That may be true, but I imagine only one of them will be implemented. >> >> >> >> @Mark - I don't have a clear idea whether you consider the nafloat64 >> >> option to be still in play as the first thing to be implemented >> >> (before array.mask). If it is, what kind of thing would persuade you >> >> either way? >> >> >> > >> > Mark can speak for himself, but I think things are tending towards >> masks. >> > They have the advantage of one implementation for all data types, >> current >> > and future, and they are more flexible since the masked data can be >> actual >> > valid data that you just choose to ignore for experimental reasons. >> > >> > What might be helpful is a routine to import/export R files, but that >> > shouldn't be to difficult to implement. >> > >> > Chuck >> > >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> >> Perhaps we should make a wiki page someplace summarizing pros and cons >> of the various implementation approaches? I worry very seriously about >> adding API functions relating to masks rather than having special NA >> values which propagate in algorithms. The question is: will Joe Blow >> Former R user have to understand what is the mask and how to work with >> it? If the answer is yes we have a problem. If it can be completely >> hidden as an implementation detail, that's great. In R NAs are just >> sort of inherent-- they propagate you deal with them when you have to >> via na.rm flag in functions or is.na. >> > > I think the interface for how it looks in NumPy can be made to be pretty > close to the same with either design approach. I've updated the NEP to add > and emphasize using masked values with an np.NA singleton, with the > validitymask as the implementation mechanism which is still accessible for > those who want to still deal with the mask directly. > I think there are a lot of benefits to this idea, if I understand it correctly. Essentially, if I were to assign np.NA to an element (or a slice) of a numpy array, rather than actually assigning that value to that spot in the array, it would set the mask to True for those elements? I see a lot of benefits to this idea. Imagine in pylab mode (from pylab import *), users would have the NA name right in their namespace, just like they are used to with R. And those who want could still mess around with masks much like we are used to. Plus, I think we can still retain the good ol' C-pointer to regular data. My question is this. Will it be a soft or hard mask? In other words, if I were to assign np.NA to a spot in an array, would it kill the value that was there (if it already was initialized)? Would I still be able to share masks? Admittedly, it is a munging of two distinct ideas, but, I think the end result would still be the same. > >> The other problem I can think of with masks is the extra memory >> footprint, though maybe this is no cause for concern. >> > > The overhead is definitely worth considering, along with the extra memory > traffic it generates, and I've basically concluded that the increased > generality and flexibility is worth the added cost. > > If we go with the mask approach, one could later try and optimize the implementation of the masks to reduce the memory footprint. Potentially, one could reduce the footprint by 7/8ths! Maybe some sneaky striding tricks could help keep from too much cache misses (or go with the approach Pierre mentioned which was to calculate them all and let the masks sort them out afterwards). As a complete side-thought, I wonder how sparse arrays could play into this discussion? Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Fri Jun 24 21:57:49 2011 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 24 Jun 2011 20:57:49 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 2:09 PM, Benjamin Root wrote: > > Another example of how we use masks in matplotlib is in pcolor(). We > have > > to combine the possible masks of X, Y, and V in both the x and y > directions > > to find the final mask to use for the final output result (because each > > facet needs valid data at each corner). Having a soft-mask > implementation > > allows one to create a temporary mask to use for the operation, and to > share > > that mask across all the input data, but then let the data structures > retain > > their original masks when done. > > This is a situation where I would just... use an array and a mask, > rather than a masked array. Then lots of things -- changing fill > values, temporarily masking/unmasking things, etc. -- come from free, > just from knowing how arrays and boolean indexing work? > > With a masked array, it is "for free". Why re-invent the wheel? It has already been done for me. > Do we really get much advantage by building all these complex > operations in? I worry that we're trying to anticipate and write code > for every situation that users find themselves in, instead of just > giving them some simple, orthogonal tools. > > This is the danger, and which is why I advocate retaining the MaskedArray type that would provide the high-level "intelligent" operations, meanwhile having in the core the basic data structures for pairing a mask with an array, and to recognize a special np.NA value that would act upon the mask rather than the underlying data. Users would get very basic functionality, while the MaskedArray would continue to provide the interface that we are used to. > As a corollary, I worry that learning and keeping track of how masked > arrays work is more hassle than just ignoring them and writing the > necessary code by hand as needed. Certainly I can imagine that *if the > mask is a property of the data* then it's useful to have tools to keep > it aligned with the data through indexing and such. But some of these > other things are quicker to reimplement than to look up the docs for, > and the reimplementation is easier to read, at least for me... > What you are advocating is similar to the "tried-n-true" coding practice of Matlab users of using NaNs. You will hear from Matlab programmers about how it is the greatest idea since sliced bread (and I was one of them). Then I was introduced to Numpy, and I while I do sometimes still do the NaN approach, I realized that the masked array is a "better" way. (By the way, is this hard-mask/soft-mask stuff documented anywhere? I > spent some time groveling over the numpy docs with google's help, and > I still have no idea what it actually is.) > As for documentation, on hard/soft masks, just look at the docs for the MaskedArray constructor: $ pydoc numpy.ma.MaskedArray numpy.ma.MaskedArray = class MaskedArray(numpy.ndarray) | An array class with possibly masked values. | | Masked values of True exclude the corresponding element from any | computation. | | Construction:: | | x = MaskedArray(data, mask=nomask, dtype=None, | copy=False, subok=True, ndmin=0, fill_value=None, | keep_mask=True, hard_mask=None, shrink=True) ... | keep_mask : bool, optional | Whether to combine `mask` with the mask of the input data, if any | (True), or to use only `mask` for the output (False). Default is True. | hard_mask : bool, optional | Whether to use a hard mask or not. With a hard mask, masked values | cannot be unmasked. Default is False. | shrink : bool, optional | Whether to force compression of an empty mask. Default is True. Later, there is a little bit of info on shared masks (although the docs could be better): | unshare_mask(self) | Copy the mask and set the sharedmask flag to False. | | Whether the mask is shared between masked arrays can be seen from | the `sharedmask` property. `unshare_mask` ensures the mask is not shared. | A copy of the mask is only made if it was shared. | | See Also | -------- | sharedmask I hope that helps, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jun 24 23:59:53 2011 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jun 2011 20:59:53 -0700 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: > On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: >> This is a situation where I would just... use an array and a mask, >> rather than a masked array. Then lots of things -- changing fill >> values, temporarily masking/unmasking things, etc. -- come from free, >> just from knowing how arrays and boolean indexing work? > > With a masked array, it is "for free".? Why re-invent the wheel?? It has > already been done for me. But it's not for free at all. It's an additional concept that has to be maintained, documented, and learned (with the last cost, which is multiplied by the number of users, being by far the greatest). It's not reinventing the wheel, it's saying hey, I have wheels and axles, but what I really need the library to provide is a wheel+axle assembly! >> Do we really get much advantage by building all these complex >> operations in? I worry that we're trying to anticipate and write code >> for every situation that users find themselves in, instead of just >> giving them some simple, orthogonal tools. >> > > This is the danger, and which is why I advocate retaining the MaskedArray > type that would provide the high-level "intelligent" operations, meanwhile > having in the core the basic data structures for? pairing a mask with an > array, and to recognize a special np.NA value that would act upon the mask > rather than the underlying data.? Users would get very basic functionality, > while the MaskedArray would continue to provide the interface that we are > used to. The interface as described is quite different... in particular, all aggregate operations would change their behavior. >> As a corollary, I worry that learning and keeping track of how masked >> arrays work is more hassle than just ignoring them and writing the >> necessary code by hand as needed. Certainly I can imagine that *if the >> mask is a property of the data* then it's useful to have tools to keep >> it aligned with the data through indexing and such. But some of these >> other things are quicker to reimplement than to look up the docs for, >> and the reimplementation is easier to read, at least for me... > > What you are advocating is similar to the "tried-n-true" coding practice of > Matlab users of using NaNs.? You will hear from Matlab programmers about how > it is the greatest idea since sliced bread (and I was one of them).? Then I > was introduced to Numpy, and I while I do sometimes still do the NaN > approach, I realized that the masked array is a "better" way. Hey, no need to go around calling people Matlab programmers, you might hurt someone's feelings. But seriously, my argument is that every abstraction and new concept has a cost, and I'm dubious that the full masked array abstraction carries its weight and justifies this cost, because it's highly redundant with existing abstractions. That has nothing to do with how tried-and-true anything is. > As for documentation, on hard/soft masks, just look at the docs for the > MaskedArray constructor: [...snipped...] Thanks! -- Nathaniel From wesmckinn at gmail.com Sat Jun 25 00:06:46 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 25 Jun 2011 00:06:46 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: >>> This is a situation where I would just... use an array and a mask, >>> rather than a masked array. Then lots of things -- changing fill >>> values, temporarily masking/unmasking things, etc. -- come from free, >>> just from knowing how arrays and boolean indexing work? >> >> With a masked array, it is "for free".? Why re-invent the wheel?? It has >> already been done for me. > > But it's not for free at all. It's an additional concept that has to > be maintained, documented, and learned (with the last cost, which is > multiplied by the number of users, being by far the greatest). It's > not reinventing the wheel, it's saying hey, I have wheels and axles, > but what I really need the library to provide is a wheel+axle > assembly! You're communicating my argument better than I am. >>> Do we really get much advantage by building all these complex >>> operations in? I worry that we're trying to anticipate and write code >>> for every situation that users find themselves in, instead of just >>> giving them some simple, orthogonal tools. >>> >> >> This is the danger, and which is why I advocate retaining the MaskedArray >> type that would provide the high-level "intelligent" operations, meanwhile >> having in the core the basic data structures for? pairing a mask with an >> array, and to recognize a special np.NA value that would act upon the mask >> rather than the underlying data.? Users would get very basic functionality, >> while the MaskedArray would continue to provide the interface that we are >> used to. > > The interface as described is quite different... in particular, all > aggregate operations would change their behavior. > >>> As a corollary, I worry that learning and keeping track of how masked >>> arrays work is more hassle than just ignoring them and writing the >>> necessary code by hand as needed. Certainly I can imagine that *if the >>> mask is a property of the data* then it's useful to have tools to keep >>> it aligned with the data through indexing and such. But some of these >>> other things are quicker to reimplement than to look up the docs for, >>> and the reimplementation is easier to read, at least for me... >> >> What you are advocating is similar to the "tried-n-true" coding practice of >> Matlab users of using NaNs.? You will hear from Matlab programmers about how >> it is the greatest idea since sliced bread (and I was one of them).? Then I >> was introduced to Numpy, and I while I do sometimes still do the NaN >> approach, I realized that the masked array is a "better" way. > > Hey, no need to go around calling people Matlab programmers, you might > hurt someone's feelings. > > But seriously, my argument is that every abstraction and new concept > has a cost, and I'm dubious that the full masked array abstraction > carries its weight and justifies this cost, because it's highly > redundant with existing abstractions. That has nothing to do with how > tried-and-true anything is. +1. I think I will personally only be happy if "masked array" can be implemented while incurring near-zero cost from the end user perspective. If what we end up with is a faster implementation of numpy.ma in C I'm probably going to keep on using NaN... That's why I'm entirely insistent that whatever design be dogfooded on non-expert users. If it's very much harder / trickier / nuanced than R, you will have failed. >> As for documentation, on hard/soft masks, just look at the docs for the >> MaskedArray constructor: > [...snipped...] > > Thanks! > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From charlesr.harris at gmail.com Sat Jun 25 00:42:15 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 24 Jun 2011 22:42:15 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith wrote: > > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: > >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: > >>> This is a situation where I would just... use an array and a mask, > >>> rather than a masked array. Then lots of things -- changing fill > >>> values, temporarily masking/unmasking things, etc. -- come from free, > >>> just from knowing how arrays and boolean indexing work? > >> > >> With a masked array, it is "for free". Why re-invent the wheel? It has > >> already been done for me. > > > > But it's not for free at all. It's an additional concept that has to > > be maintained, documented, and learned (with the last cost, which is > > multiplied by the number of users, being by far the greatest). It's > > not reinventing the wheel, it's saying hey, I have wheels and axles, > > but what I really need the library to provide is a wheel+axle > > assembly! > > You're communicating my argument better than I am. > > >>> Do we really get much advantage by building all these complex > >>> operations in? I worry that we're trying to anticipate and write code > >>> for every situation that users find themselves in, instead of just > >>> giving them some simple, orthogonal tools. > >>> > >> > >> This is the danger, and which is why I advocate retaining the > MaskedArray > >> type that would provide the high-level "intelligent" operations, > meanwhile > >> having in the core the basic data structures for pairing a mask with an > >> array, and to recognize a special np.NA value that would act upon the > mask > >> rather than the underlying data. Users would get very basic > functionality, > >> while the MaskedArray would continue to provide the interface that we > are > >> used to. > > > > The interface as described is quite different... in particular, all > > aggregate operations would change their behavior. > > > >>> As a corollary, I worry that learning and keeping track of how masked > >>> arrays work is more hassle than just ignoring them and writing the > >>> necessary code by hand as needed. Certainly I can imagine that *if the > >>> mask is a property of the data* then it's useful to have tools to keep > >>> it aligned with the data through indexing and such. But some of these > >>> other things are quicker to reimplement than to look up the docs for, > >>> and the reimplementation is easier to read, at least for me... > >> > >> What you are advocating is similar to the "tried-n-true" coding practice > of > >> Matlab users of using NaNs. You will hear from Matlab programmers about > how > >> it is the greatest idea since sliced bread (and I was one of them). > Then I > >> was introduced to Numpy, and I while I do sometimes still do the NaN > >> approach, I realized that the masked array is a "better" way. > > > > Hey, no need to go around calling people Matlab programmers, you might > > hurt someone's feelings. > > > > But seriously, my argument is that every abstraction and new concept > > has a cost, and I'm dubious that the full masked array abstraction > > carries its weight and justifies this cost, because it's highly > > redundant with existing abstractions. That has nothing to do with how > > tried-and-true anything is. > > +1. I think I will personally only be happy if "masked array" can be > implemented while incurring near-zero cost from the end user > perspective. If what we end up with is a faster implementation of > numpy.ma in C I'm probably going to keep on using NaN... That's why > I'm entirely insistent that whatever design be dogfooded on non-expert > users. If it's very much harder / trickier / nuanced than R, you will > have failed. > > This sounds unduly pessimistic to me. It's one thing to suggest different approaches, another to cry doom and threaten to go eat worms. And all before the code is written, benchmarks run, or trial made of the usefulness of the approach. Let us see how things look as they get worked out. Mark has a good track record for innovative tools and I'm rather curious myself to see what the result is. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgautier at gmail.com Sat Jun 25 02:36:04 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 25 Jun 2011 08:36:04 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: Message-ID: <4E0581D4.7050902@gmail.com> On 2011-06-24 17:30, Robert Kern wrote: > On Fri, Jun 24, 2011 at 10:07, Laurent Gautier wrote: >> > On 2011-06-24 16:43, Robert Kern wrote: >>> >> >>> >> On Fri, Jun 24, 2011 at 09:33, Charles R Harris >>> >> wrote: >>>> >>> >>>>> >>> > >>>>> >>> > ?On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern >>>>> >>> > ?wrote: >>>>> >>>> >>>>>>> >>>> >> ?The alternative proposal would be to add a few new dtypes that are >>>>>>> >>>> >> ?NA-aware. E.g. an nafloat64 would reserve a particular NaN value >>>>>>> >>>> >> ?(there are lots of different NaN bit patterns, we'd just reserve >>>>>>> >>>> >> one) >>>>>>> >>>> >> ?that would represent NA. An naint32 would probably reserve the most >>>>>>> >>>> >> ?negative int32 value (like R does). Using the NA-aware dtypes >>>>>>> >>>> >> signals >>>>>>> >>>> >> ?that you are using NA values; there is no need for an additional >>>>>>> >>>> >> flag. >>>> >>> >>>>> >>> > >>>>> >>> > ?Definitely better names than r-int32. Going this way has the advantage >>>>> >>> > of >>>>> >>> > ?reducing the friction between R and numpy, and since R has pretty much >>>>> >>> > ?become the standard software for statistics that is an important >>>>> >>> > ?consideration. >>> >> >>> >> I would definitely steal their choices of NA value for naint32 and >>> >> nafloat64. I have reservations about their string NA value (i.e. 'NA') >>> >> as anyone doing business in North America and other continents may >>> >> have issues with that.... >> > >> > May be there is not so much need for reservation over the string NA, when >> > making the distinction between: >> > a- the internal representation of a "missing string" (what is stored in >> > memory, and that C-level code would need to be aware of) >> > b- the 'external' representation of a missing string (in Python, what would >> > be returned by repr() ) >> > c- what is assumed to be a missing string value when reading from a file. >> > >> > a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can >> > be configured as a module-level, class-level, or instance-level variable. > In R, a/ happens to be 'NA', unfortunately. :-/ In a sense yes, in a sense no. There is NA_STRING (that happens to store 'NA', but it could equally be 'foobar' or whatever) and there is "NA". NA_STRING is set once for all, and each time a string element in a vector is set to NA this points to that one. A string "NA" is not the NA_STRING. > I'm not really sure how they handle datasets that use valid 'NA' > values. Presumably, their input routines allow one to convert such > values to something else such that it can use 'NA'==NA internally. That's c/. Example in R's read.table(..., na.strings = "NA", ...). A number of R design choices (that are S design choices here) are empirical and based on experience. Datasets can originally be in a number of different flavours and conversion can be made when reading data into memory / R format. L. > -- Robert Kern From emmanuelle.gouillart at normalesup.org Sat Jun 25 04:24:02 2011 From: emmanuelle.gouillart at normalesup.org (Emmanuelle Gouillart) Date: Sat, 25 Jun 2011 10:24:02 +0200 Subject: [Numpy-discussion] [ANN] Euroscipy 2011 - registration now open Message-ID: <20110625082402.GA6260@phare.normalesup.org> Dear all, After some delay due to technical problems, registration for Euroscipy 2011 is now open! Please go to http://www.euroscipy.org/conference/euroscipy2011, login to your account if you have one, or create a new account (right side of the upper banner of the Euroscipy webpage), then click on the registration panel on the sidebar on the left. Choose the items (tutorials and/or conference) that you wish to attend to, and proceed to pay for your items by credit card. Early-bird registration fees are as follows: * for academia, and self-employed persons: 50 euros for the tutorials (two days), and 50 euros for the conference (two days). * for corporate participants: 100 euros for tutorials (two days), and 100 euros for the conference (two days). For all days, lunch will be catered for and is included in the fee. Early-bird fees apply until July 24th; prices will be doubled for late registration. Book early to take advantage of the early-bird prices! As there is a limited number of seats in the lecture halls, registrations shall be accepted on a first-come first-serve basis. All questions regarding registration should be addressed exclusively to org-team at lists.euroscipy.org Looking forward to seeing you at Euroscipy! The organizers From matthew.brett at gmail.com Sat Jun 25 07:00:14 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 12:00:14 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 1:54 AM, Mark Wiebe wrote: > On Fri, Jun 24, 2011 at 5:21 PM, Matthew Brett ... >> @Mark - I don't have a clear idea whether you consider the nafloat64 >> option to be still in play as the first thing to be implemented >> (before array.mask). ? If it is, what kind of thing would persuade you >> either way? > > I'm focusing all of my effort on getting my proposal of adding a mask to the > core ndarray into a state where it satisfies everyone's requirements as best > I can. Maybe it would be worth setting out the requirements formally somewhere? > I'm not precluding the possibility that someone could convince me > that the na-dtype is good, but I gave it a good chunk of thought before > starting to write the proposal. To persuade me towards the na-dtype option, > I need to be convinced that I'm solving the problem class in a generic way > that works orthogonally with other features, with manageable implementation > requirements, a very usable result for both strong and weak programmers, and > with good performance characteristics. I think the na-dtype approach isn't > as generic as I would like, and the implementation seems like it would be > trickier than the masked approach. What I'm getting at, is that I think you have made the decision between these two implementations some time ago while looking at the C code. Now of course you would be a much better person to make that decision than - say - me. It's just that, if you want coherent feedback from us on this decision, we need to get some technical grasp of why you made it. I realize that it will not be easy to explain in detail, but honestly, it could be a useful discussion to have from your and our point of view, even if it ends up in the same place. See you, Matthew From matthew.brett at gmail.com Sat Jun 25 07:17:53 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 12:17:53 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe wrote: > On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney >> wrote: >> ... >> > Perhaps we should make a wiki page someplace summarizing pros and cons >> > of the various implementation approaches? >> >> But - we should do this if it really is an open question which one we >> go for. ? If not then, we're just slowing Mark down in getting to the >> implementation. >> >> Assuming the question is still open, here's a starter for the pros and >> cons: >> >> array.mask >> 1) It's easier / neater to implement > > Yes > >> >> 2) It can generalize across dtypes > > Yes > >> >> 3) You can still get the masked data underneath the mask (allowing you >> to unmask etc) > > By setting up views appropriately, yes. If you don't have another view to > the underlying data, you can't get at it. >> >> nafloat64: >> 1) No memory overhead > > Yes > >> >> 2) Battle-tested implementation already done in R > > We can't really use that though, ?R is GPL and NumPy is BSD. The low-level > implementation details are likely different enough that a re-implementation > would be needed anyway. Right - I wasn't suggesting using the code, only that the idea can be made to work coherently with an API that seems to have won friends over time. >> I guess we'd have to test directly whether the non-continuous memory >> of the mask and data would cause enough cache-miss problems to >> outweigh the potential cycle-savings from single byte comparisons in >> array.mask. > > The different memory buffers are each contiguous, so the access patterns > still have a lot of coherency. I intend to give the mask memory layouts > matching those of the arrays. >> >> I guess that one and only one of these will get written. ?I guess that >> one of these choices may be a lot more satisfying to the current and >> future masked array itch than the other. > > I'm only going to implement one solution, yes. >> >> I'm personally worried that the memory overhead of array.masks will >> make many of us tend to avoid them. ?I work with images that can >> easily get large enough that I would not want an array-items size byte >> array added to my storage. > > May I ask what kind of dtypes and sizes you're working with? dtypes for images usually end up as floats - float32 or float64. On disk, and when memory mapped, they are often int16 or uint16. Sizes vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64) to rather large 4D images - say 256 x 256 x 50 x 500 at the very high end (12.5G in float64). >> The reason I'm asking for more details about the implementation is >> because that is most of the argument for array.mask at the moment (1 >> and 2 above). > > I'm first trying to nail down more of the higher level requirements before > digging really deep into the implementation details. They greatly affect how > those details have to turn out. Once you've started with the array.mask framework, you've committed yourself to the memory hit, and you may lose potential users who often hit memory limits. My guess is that no-one currently using np.ma is in that category, because it also uses a separate mask array, as I understand it. See you, Matthew From pgmdevlist at gmail.com Sat Jun 25 07:29:26 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Sat, 25 Jun 2011 13:29:26 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: This thread is getting quite long, innit ? And I think it's getting a tad confusing, because we're mixing two different concepts: missing values and masks. There should be support for missing values in numpy.core, I think we all agree on that. * What's been suggested of adding new dtypes (nafloat, naint) is great, by why not making it the default, then ? * Operations involving a NA (whatever the NA actually is, depending on the dtype of the input) should result in a NA (whatever the NA defined by the outputs dtype). That could be done by overloading the existing ufuncs to support the new dtypes. * There should be some simple methods to retrieve the location of those NAs in an array. Whether we just output the indices or a full boolean array (w/ True for a NA, False for a non-NA or vice-versa) needs to be decided. * We can always re-implement masked arrays to use these NAs in a way which would be consistent with numpy.ma (so as not to confuse existing users of numpy.ma): a mask would be a boolean array with the same shape than the underlying ndarray, with True for NA. Mark, I'd suggest you modify your proposal, making it clearer that it's not to add all of numpy.ma functionalities in the core, but just support these missing values. Using the term 'mask' should be avoided as much as possible, use a 'missing data' or whatever. From gael.varoquaux at normalesup.org Sat Jun 25 08:00:44 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 25 Jun 2011 14:00:44 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: <20110625120044.GA26847@phare.normalesup.org> On Sat, Jun 25, 2011 at 01:02:07AM +0100, Matthew Brett wrote: > I'm personally worried that the memory overhead of array.masks will > make many of us tend to avoid them. I work with images that can > easily get large enough that I would not want an array-items size byte > array added to my storage. I work with the same kind of data ( :D ). The way we manipulate our data, in my lab, is, to represent 3D data with a mask to use a 1D array and a corresponding 3D mask. It reduces the memory footprint, to the cost of loosing the 3D shape of the data. I am raising this because it is an option that has not been suggested so far. I am not saying that it should be the option used to implement mask arrays, I am just saying that there are many different ways of doing it, and I don't think that there is a one-size-fits-all solution. I tend to feel like Wes: good building block to easily implement our own solutions is what we want. My 2 cents, Gael From wesmckinn at gmail.com Sat Jun 25 10:14:47 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 25 Jun 2011 10:14:47 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris wrote: > > > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney wrote: >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith wrote: >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: >> >>> This is a situation where I would just... use an array and a mask, >> >>> rather than a masked array. Then lots of things -- changing fill >> >>> values, temporarily masking/unmasking things, etc. -- come from free, >> >>> just from knowing how arrays and boolean indexing work? >> >> >> >> With a masked array, it is "for free".? Why re-invent the wheel?? It >> >> has >> >> already been done for me. >> > >> > But it's not for free at all. It's an additional concept that has to >> > be maintained, documented, and learned (with the last cost, which is >> > multiplied by the number of users, being by far the greatest). It's >> > not reinventing the wheel, it's saying hey, I have wheels and axles, >> > but what I really need the library to provide is a wheel+axle >> > assembly! >> >> You're communicating my argument better than I am. >> >> >>> Do we really get much advantage by building all these complex >> >>> operations in? I worry that we're trying to anticipate and write code >> >>> for every situation that users find themselves in, instead of just >> >>> giving them some simple, orthogonal tools. >> >>> >> >> >> >> This is the danger, and which is why I advocate retaining the >> >> MaskedArray >> >> type that would provide the high-level "intelligent" operations, >> >> meanwhile >> >> having in the core the basic data structures for? pairing a mask with >> >> an >> >> array, and to recognize a special np.NA value that would act upon the >> >> mask >> >> rather than the underlying data.? Users would get very basic >> >> functionality, >> >> while the MaskedArray would continue to provide the interface that we >> >> are >> >> used to. >> > >> > The interface as described is quite different... in particular, all >> > aggregate operations would change their behavior. >> > >> >>> As a corollary, I worry that learning and keeping track of how masked >> >>> arrays work is more hassle than just ignoring them and writing the >> >>> necessary code by hand as needed. Certainly I can imagine that *if the >> >>> mask is a property of the data* then it's useful to have tools to keep >> >>> it aligned with the data through indexing and such. But some of these >> >>> other things are quicker to reimplement than to look up the docs for, >> >>> and the reimplementation is easier to read, at least for me... >> >> >> >> What you are advocating is similar to the "tried-n-true" coding >> >> practice of >> >> Matlab users of using NaNs.? You will hear from Matlab programmers >> >> about how >> >> it is the greatest idea since sliced bread (and I was one of them). >> >> Then I >> >> was introduced to Numpy, and I while I do sometimes still do the NaN >> >> approach, I realized that the masked array is a "better" way. >> > >> > Hey, no need to go around calling people Matlab programmers, you might >> > hurt someone's feelings. >> > >> > But seriously, my argument is that every abstraction and new concept >> > has a cost, and I'm dubious that the full masked array abstraction >> > carries its weight and justifies this cost, because it's highly >> > redundant with existing abstractions. That has nothing to do with how >> > tried-and-true anything is. >> >> +1. I think I will personally only be happy if "masked array" can be >> implemented while incurring near-zero cost from the end user >> perspective. If what we end up with is a faster implementation of >> numpy.ma in C I'm probably going to keep on using NaN... That's why >> I'm entirely insistent that whatever design be dogfooded on non-expert >> users. If it's very much harder / trickier / nuanced than R, you will >> have failed. >> > > This sounds unduly pessimistic to me. It's one thing to suggest different > approaches, another to cry doom and threaten to go eat worms. And all before > the code is written, benchmarks run, or trial made of the usefulness of the > approach. Let us see how things look as they get worked out. Mark has a good > track record for innovative tools and I'm rather curious myself to see what > the result is. > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > I hope you're right. So far it seems that anyone who has spent real time with R (e.g. myself, Nathaniel) has expressed serious concerns about the masked approach. And we got into this discussion at the Data Array summit in Austin last month because we're trying to make Python more competitive with R viz statistical and financial applications. I'm just trying to be (R)ealistic =P Remember that I very earnestly am doing everything I can these days to make scientific Python more successful in finance and statistics. One big difference with R's approach is that we care more about performance the the R community does. So maybe having special NA values will be prohibitive for that reason. Mark indeed has a fantastic track record and I've been extremely impressed with his NumPy work, so I've no doubt he'll do a good job. I just hope that you don't push aside my input-- my opinions are formed entirely based on my domain experience. -W From charlesr.harris at gmail.com Sat Jun 25 10:21:36 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 08:21:36 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM wrote: > This thread is getting quite long, innit ? > And I think it's getting a tad confusing, because we're mixing two > different concepts: missing values and masks. > There should be support for missing values in numpy.core, I think we all > agree on that. > * What's been suggested of adding new dtypes (nafloat, naint) is great, by > why not making it the default, then ? > * Operations involving a NA (whatever the NA actually is, depending on the > dtype of the input) should result in a NA (whatever the NA defined by the > outputs dtype). That could be done by overloading the existing ufuncs to > support the new dtypes. > * There should be some simple methods to retrieve the location of those NAs > in an array. Whether we just output the indices or a full boolean array (w/ > True for a NA, False for a non-NA or vice-versa) needs to be decided. > * We can always re-implement masked arrays to use these NAs in a way which > would be consistent with numpy.ma (so as not to confuse existing users of > numpy.ma): a mask would be a boolean array with the same shape than the > underlying ndarray, with True for NA. > Mark, I'd suggest you modify your proposal, making it clearer that it's not > to add all of numpy.ma functionalities in the core, but just support these > missing values. Using the term 'mask' should be avoided as much as possible, > use a 'missing data' or whatever. > I think he aims to support both. One complication with masks is keeping them tied to the data on disk. With na values one file can contain both the data and the missing data markers, whereas with masks, two files would be required. I don't think that will fly in the long run unless there is some standard file format, like geotiff for GIS, that combines both. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 25 10:25:33 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 08:25:33 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney wrote: > On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris > wrote: > > > > > > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney > wrote: > >> > >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith > wrote: > >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root > wrote: > >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith > wrote: > >> >>> This is a situation where I would just... use an array and a mask, > >> >>> rather than a masked array. Then lots of things -- changing fill > >> >>> values, temporarily masking/unmasking things, etc. -- come from > free, > >> >>> just from knowing how arrays and boolean indexing work? > >> >> > >> >> With a masked array, it is "for free". Why re-invent the wheel? It > >> >> has > >> >> already been done for me. > >> > > >> > But it's not for free at all. It's an additional concept that has to > >> > be maintained, documented, and learned (with the last cost, which is > >> > multiplied by the number of users, being by far the greatest). It's > >> > not reinventing the wheel, it's saying hey, I have wheels and axles, > >> > but what I really need the library to provide is a wheel+axle > >> > assembly! > >> > >> You're communicating my argument better than I am. > >> > >> >>> Do we really get much advantage by building all these complex > >> >>> operations in? I worry that we're trying to anticipate and write > code > >> >>> for every situation that users find themselves in, instead of just > >> >>> giving them some simple, orthogonal tools. > >> >>> > >> >> > >> >> This is the danger, and which is why I advocate retaining the > >> >> MaskedArray > >> >> type that would provide the high-level "intelligent" operations, > >> >> meanwhile > >> >> having in the core the basic data structures for pairing a mask with > >> >> an > >> >> array, and to recognize a special np.NA value that would act upon the > >> >> mask > >> >> rather than the underlying data. Users would get very basic > >> >> functionality, > >> >> while the MaskedArray would continue to provide the interface that we > >> >> are > >> >> used to. > >> > > >> > The interface as described is quite different... in particular, all > >> > aggregate operations would change their behavior. > >> > > >> >>> As a corollary, I worry that learning and keeping track of how > masked > >> >>> arrays work is more hassle than just ignoring them and writing the > >> >>> necessary code by hand as needed. Certainly I can imagine that *if > the > >> >>> mask is a property of the data* then it's useful to have tools to > keep > >> >>> it aligned with the data through indexing and such. But some of > these > >> >>> other things are quicker to reimplement than to look up the docs > for, > >> >>> and the reimplementation is easier to read, at least for me... > >> >> > >> >> What you are advocating is similar to the "tried-n-true" coding > >> >> practice of > >> >> Matlab users of using NaNs. You will hear from Matlab programmers > >> >> about how > >> >> it is the greatest idea since sliced bread (and I was one of them). > >> >> Then I > >> >> was introduced to Numpy, and I while I do sometimes still do the NaN > >> >> approach, I realized that the masked array is a "better" way. > >> > > >> > Hey, no need to go around calling people Matlab programmers, you might > >> > hurt someone's feelings. > >> > > >> > But seriously, my argument is that every abstraction and new concept > >> > has a cost, and I'm dubious that the full masked array abstraction > >> > carries its weight and justifies this cost, because it's highly > >> > redundant with existing abstractions. That has nothing to do with how > >> > tried-and-true anything is. > >> > >> +1. I think I will personally only be happy if "masked array" can be > >> implemented while incurring near-zero cost from the end user > >> perspective. If what we end up with is a faster implementation of > >> numpy.ma in C I'm probably going to keep on using NaN... That's why > >> I'm entirely insistent that whatever design be dogfooded on non-expert > >> users. If it's very much harder / trickier / nuanced than R, you will > >> have failed. > >> > > > > This sounds unduly pessimistic to me. It's one thing to suggest different > > approaches, another to cry doom and threaten to go eat worms. And all > before > > the code is written, benchmarks run, or trial made of the usefulness of > the > > approach. Let us see how things look as they get worked out. Mark has a > good > > track record for innovative tools and I'm rather curious myself to see > what > > the result is. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > I hope you're right. So far it seems that anyone who has spent real > time with R (e.g. myself, Nathaniel) has expressed serious concerns > about the masked approach. And we got into this discussion at the Data > Array summit in Austin last month because we're trying to make Python > more competitive with R viz statistical and financial applications. > I'm just trying to be (R)ealistic =P Remember that I very earnestly am > doing everything I can these days to make scientific Python more > successful in finance and statistics. One big difference with R's > approach is that we care more about performance the the R community > does. So maybe having special NA values will be prohibitive for that > reason. > > Mark indeed has a fantastic track record and I've been extremely > impressed with his NumPy work, so I've no doubt he'll do a good job. I > just hope that you don't push aside my input-- my opinions are formed > entirely based on my domain experience. > > I think what we really need to see are the use cases and work flow. The ones that hadn't occurred to me before were memory mapped files and data stored on disk in general. I think we may need some standard format for masked data on disk if we don't go the NA value route. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 25 10:27:57 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 08:27:57 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <20110625120044.GA26847@phare.normalesup.org> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> Message-ID: On Sat, Jun 25, 2011 at 6:00 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Sat, Jun 25, 2011 at 01:02:07AM +0100, Matthew Brett wrote: > > I'm personally worried that the memory overhead of array.masks will > > make many of us tend to avoid them. I work with images that can > > easily get large enough that I would not want an array-items size byte > > array added to my storage. > > I work with the same kind of data ( :D ). > > The way we manipulate our data, in my lab, is, to represent 3D data with > a mask to use a 1D array and a corresponding 3D mask. It reduces the > memory footprint, to the cost of loosing the 3D shape of the data. > > I am raising this because it is an option that has not been suggested so > far. I am not saying that it should be the option used to implement mask > arrays, I am just saying that there are many different ways of doing it, > and I don't think that there is a one-size-fits-all solution. > > I tend to feel like Wes: good building block to easily implement our own > solutions is what we want. > > Could you expand a bit on what sort of data you have and how you deal with it. Where does it come from, how is it stored on disk, what do you do with it? That sort of thing. Chuck. -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 25 10:31:05 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 15:31:05 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 3:21 PM, Charles R Harris wrote: > > > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM wrote: >> >> This thread is getting quite long, innit ? >> And I think it's getting a tad confusing, because we're mixing two >> different concepts: missing values and masks. >> There should be support for missing values in numpy.core, I think we all >> agree on that. >> * What's been suggested of adding new dtypes (nafloat, naint) is great, by >> why not making it the default, then ? >> >> * Operations involving a NA (whatever the NA actually is, depending on the >> dtype of the input) should result in a NA (whatever the NA defined by the >> outputs dtype). That could be done by overloading the existing ufuncs to >> support the new dtypes. >> * There should be some simple methods to retrieve the location of those >> NAs in an array. Whether we just output the indices or a full boolean array >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be decided. >> * We can always re-implement masked arrays to use these NAs in a way which >> would be consistent with numpy.ma (so as not to confuse existing users of >> numpy.ma): a mask would be a boolean array with the same shape than the >> underlying ndarray, with True for NA. >> Mark, I'd suggest you modify your proposal, making it clearer that it's >> not to add all of numpy.ma functionalities in the core, but just support >> these missing values. Using the term 'mask' should be avoided as much as >> possible, use a 'missing data' or whatever. > > I think he aims to support both. I don't think Mark is proposing to support both. He's proposing to implement only array.mask. See you, Matthew From matthew.brett at gmail.com Sat Jun 25 10:40:57 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 15:40:57 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 3:14 PM, Wes McKinney wrote: ... > I hope you're right. So far it seems that anyone who has spent real > time with R (e.g. myself, Nathaniel) has expressed serious concerns > about the masked approach. I'm sorry - I have been distracted. For my sake, and because this thread has been so long, would you mind setting out again the API differences that you think are important between array.mask and say na-dtype? > One big difference with R's > approach is that we care more about performance the the R community > does. So maybe having special NA values will be prohibitive for that > reason. Do you have reason to think the NA values will be slower? I was guessing they'd be faster, but it seems to me that there's so much involved that we would need two implementations to test. > Mark indeed has a fantastic track record and I've been extremely > impressed with his NumPy work, so I've no doubt he'll do a good job. I > just hope that you don't push aside my input-- my opinions are formed > entirely based on my domain experience. Yup. The thread is long because we all have good hope that whatever gets in and is agreed will be important and useful. See you, Matthew From wesmckinn at gmail.com Sat Jun 25 10:44:15 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 25 Jun 2011 10:44:15 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris wrote: > > > On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney wrote: >> >> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris >> wrote: >> > >> > >> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney >> > wrote: >> >> >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith >> >> wrote: >> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root >> >> > wrote: >> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith >> >> >> wrote: >> >> >>> This is a situation where I would just... use an array and a mask, >> >> >>> rather than a masked array. Then lots of things -- changing fill >> >> >>> values, temporarily masking/unmasking things, etc. -- come from >> >> >>> free, >> >> >>> just from knowing how arrays and boolean indexing work? >> >> >> >> >> >> With a masked array, it is "for free".? Why re-invent the wheel?? It >> >> >> has >> >> >> already been done for me. >> >> > >> >> > But it's not for free at all. It's an additional concept that has to >> >> > be maintained, documented, and learned (with the last cost, which is >> >> > multiplied by the number of users, being by far the greatest). It's >> >> > not reinventing the wheel, it's saying hey, I have wheels and axles, >> >> > but what I really need the library to provide is a wheel+axle >> >> > assembly! >> >> >> >> You're communicating my argument better than I am. >> >> >> >> >>> Do we really get much advantage by building all these complex >> >> >>> operations in? I worry that we're trying to anticipate and write >> >> >>> code >> >> >>> for every situation that users find themselves in, instead of just >> >> >>> giving them some simple, orthogonal tools. >> >> >>> >> >> >> >> >> >> This is the danger, and which is why I advocate retaining the >> >> >> MaskedArray >> >> >> type that would provide the high-level "intelligent" operations, >> >> >> meanwhile >> >> >> having in the core the basic data structures for? pairing a mask >> >> >> with >> >> >> an >> >> >> array, and to recognize a special np.NA value that would act upon >> >> >> the >> >> >> mask >> >> >> rather than the underlying data.? Users would get very basic >> >> >> functionality, >> >> >> while the MaskedArray would continue to provide the interface that >> >> >> we >> >> >> are >> >> >> used to. >> >> > >> >> > The interface as described is quite different... in particular, all >> >> > aggregate operations would change their behavior. >> >> > >> >> >>> As a corollary, I worry that learning and keeping track of how >> >> >>> masked >> >> >>> arrays work is more hassle than just ignoring them and writing the >> >> >>> necessary code by hand as needed. Certainly I can imagine that *if >> >> >>> the >> >> >>> mask is a property of the data* then it's useful to have tools to >> >> >>> keep >> >> >>> it aligned with the data through indexing and such. But some of >> >> >>> these >> >> >>> other things are quicker to reimplement than to look up the docs >> >> >>> for, >> >> >>> and the reimplementation is easier to read, at least for me... >> >> >> >> >> >> What you are advocating is similar to the "tried-n-true" coding >> >> >> practice of >> >> >> Matlab users of using NaNs.? You will hear from Matlab programmers >> >> >> about how >> >> >> it is the greatest idea since sliced bread (and I was one of them). >> >> >> Then I >> >> >> was introduced to Numpy, and I while I do sometimes still do the NaN >> >> >> approach, I realized that the masked array is a "better" way. >> >> > >> >> > Hey, no need to go around calling people Matlab programmers, you >> >> > might >> >> > hurt someone's feelings. >> >> > >> >> > But seriously, my argument is that every abstraction and new concept >> >> > has a cost, and I'm dubious that the full masked array abstraction >> >> > carries its weight and justifies this cost, because it's highly >> >> > redundant with existing abstractions. That has nothing to do with how >> >> > tried-and-true anything is. >> >> >> >> +1. I think I will personally only be happy if "masked array" can be >> >> implemented while incurring near-zero cost from the end user >> >> perspective. If what we end up with is a faster implementation of >> >> numpy.ma in C I'm probably going to keep on using NaN... That's why >> >> I'm entirely insistent that whatever design be dogfooded on non-expert >> >> users. If it's very much harder / trickier / nuanced than R, you will >> >> have failed. >> >> >> > >> > This sounds unduly pessimistic to me. It's one thing to suggest >> > different >> > approaches, another to cry doom and threaten to go eat worms. And all >> > before >> > the code is written, benchmarks run, or trial made of the usefulness of >> > the >> > approach. Let us see how things look as they get worked out. Mark has a >> > good >> > track record for innovative tools and I'm rather curious myself to see >> > what >> > the result is. >> > >> > Chuck >> > >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> >> I hope you're right. So far it seems that anyone who has spent real >> time with R (e.g. myself, Nathaniel) has expressed serious concerns >> about the masked approach. And we got into this discussion at the Data >> Array summit in Austin last month because we're trying to make Python >> more competitive with R viz statistical and financial applications. >> I'm just trying to be (R)ealistic =P Remember that I very earnestly am >> doing everything I can these days to make scientific Python more >> successful in finance and statistics. One big difference with R's >> approach is that we care more about performance the the R community >> does. So maybe having special NA values will be prohibitive for that >> reason. >> >> Mark indeed has a fantastic track record and I've been extremely >> impressed with his NumPy work, so I've no doubt he'll do a good job. I >> just hope that you don't push aside my input-- my opinions are formed >> entirely based on my domain experience. >> > > I think what we really need to see are the use cases and work flow. The ones > that hadn't occurred to me before were memory mapped files and data stored > on disk in general. I think we may need some standard format for masked data > on disk if we don't go the NA value route. > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > Here are some things I can think of that would be affected by any changes here 1) Right now users of pandas can type pandas.isnull(series[5]) and that will yield True if the value is NA for any dtype. This might be hard to support in the masked regime 2) Functions like {Series, DataFrame}.fillna would hopefully look just like this: # value is 0 or some other value to fill new_series = self.copy() new_series[isnull(new_series)] = value Keep in mind that people will write custom NA handling logic. So they might do: series[isnull(other_series) & isnull(other_series2)] = val 3) Nulling / NA-ing out data is very common # null out this data up to and including date1 in these three columns frame.ix[:date1, [col1, col2, col3]] = NaN # But this should work fine too frame.ix[:date1, [col1, col2, col3]] = 0 I'll try to think of some others. The main thing is that the NA value is very easy to think about and fits in naturally with how people (at least statistical / financial users) think about and work with data. If you have to say "I have to set these mask locations to True" it introduces additional mental effort compared with "I'll just set these values to NA" From charlesr.harris at gmail.com Sat Jun 25 10:46:18 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 08:46:18 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 8:31 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 3:21 PM, Charles R Harris > wrote: > > > > > > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM wrote: > >> > >> This thread is getting quite long, innit ? > >> And I think it's getting a tad confusing, because we're mixing two > >> different concepts: missing values and masks. > >> There should be support for missing values in numpy.core, I think we all > >> agree on that. > >> * What's been suggested of adding new dtypes (nafloat, naint) is great, > by > >> why not making it the default, then ? > >> > >> * Operations involving a NA (whatever the NA actually is, depending on > the > >> dtype of the input) should result in a NA (whatever the NA defined by > the > >> outputs dtype). That could be done by overloading the existing ufuncs to > >> support the new dtypes. > >> * There should be some simple methods to retrieve the location of those > >> NAs in an array. Whether we just output the indices or a full boolean > array > >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be > decided. > >> * We can always re-implement masked arrays to use these NAs in a way > which > >> would be consistent with numpy.ma (so as not to confuse existing users > of > >> numpy.ma): a mask would be a boolean array with the same shape than the > >> underlying ndarray, with True for NA. > >> Mark, I'd suggest you modify your proposal, making it clearer that it's > >> not to add all of numpy.ma functionalities in the core, but just > support > >> these missing values. Using the term 'mask' should be avoided as much as > >> possible, use a 'missing data' or whatever. > > > > I think he aims to support both. > > I don't think Mark is proposing to support both. He's proposing to > implement only array.mask. > > I think you are confusing function with implementation. If you look at the current NEP, it does NA but does so by using masks behind the scene in a transparent manner. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 25 10:52:00 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 15:52:00 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 3:46 PM, Charles R Harris wrote: > > > On Sat, Jun 25, 2011 at 8:31 AM, Matthew Brett > wrote: >> >> Hi, >> >> On Sat, Jun 25, 2011 at 3:21 PM, Charles R Harris >> wrote: >> > >> > >> > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM wrote: >> >> >> >> This thread is getting quite long, innit ? >> >> And I think it's getting a tad confusing, because we're mixing two >> >> different concepts: missing values and masks. >> >> There should be support for missing values in numpy.core, I think we >> >> all >> >> agree on that. >> >> * What's been suggested of adding new dtypes (nafloat, naint) is great, >> >> by >> >> why not making it the default, then ? >> >> >> >> * Operations involving a NA (whatever the NA actually is, depending on >> >> the >> >> dtype of the input) should result in a NA (whatever the NA defined by >> >> the >> >> outputs dtype). That could be done by overloading the existing ufuncs >> >> to >> >> support the new dtypes. >> >> * There should be some simple methods to retrieve the location of those >> >> NAs in an array. Whether we just output the indices or a full boolean >> >> array >> >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be >> >> decided. >> >> * We can always re-implement masked arrays to use these NAs in a way >> >> which >> >> would be consistent with numpy.ma (so as not to confuse existing users >> >> of >> >> numpy.ma): a mask would be a boolean array with the same shape than the >> >> underlying ndarray, with True for NA. >> >> Mark, I'd suggest you modify your proposal, making it clearer that it's >> >> not to add all of numpy.ma functionalities in the core, but just >> >> support >> >> these missing values. Using the term 'mask' should be avoided as much >> >> as >> >> possible, use a 'missing data' or whatever. >> > >> > I think he aims to support both. >> >> I don't think Mark is proposing to support both. ?He's proposing to >> implement only array.mask. >> > > I think you are confusing function with implementation. If you look at the > current NEP, it does NA but does so by using masks behind the scene in a > transparent manner. Yes, there is some confusion; just to be clear, I'm pointing out that Mark is not proposing to implement na-dtypes, and is proposing to implement array.mask. See you, Matthew From gael.varoquaux at normalesup.org Sat Jun 25 10:53:14 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 25 Jun 2011 16:53:14 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> Message-ID: <20110625145314.GB6769@phare.normalesup.org> On Sat, Jun 25, 2011 at 08:27:57AM -0600, Charles R Harris wrote: > Could you expand a bit on what sort of data you have and how you deal with > it. Where does it come from, how is it stored on disk, what do you do with > it? That sort of thing. 3D and 4D images. Mostly stored on disk in domain-specific files (nifti files) althought I some-times store the arrays that I use in numpy files. I do a lot of statistical analysis on it, mostly multivariate (machine learning stuff, ICA, and what not). Mask and missang values can mean too things: the mask is often used to design which part of the images in interesting (i.e. the part of the image that is in the brain). Sometime I also have missing values, because in a time-series some data has been corrupted. I don't think that the same implementation of masking/missing-value with fit both needs. G From charlesr.harris at gmail.com Sat Jun 25 10:55:23 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 08:55:23 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 8:44 AM, Wes McKinney wrote: > On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris > wrote: > > > > > > On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney > wrote: > >> > >> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney > >> > wrote: > >> >> > >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith > >> >> wrote: > >> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root > >> >> > wrote: > >> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith > >> >> >> wrote: > >> >> >>> This is a situation where I would just... use an array and a > mask, > >> >> >>> rather than a masked array. Then lots of things -- changing fill > >> >> >>> values, temporarily masking/unmasking things, etc. -- come from > >> >> >>> free, > >> >> >>> just from knowing how arrays and boolean indexing work? > >> >> >> > >> >> >> With a masked array, it is "for free". Why re-invent the wheel? > It > >> >> >> has > >> >> >> already been done for me. > >> >> > > >> >> > But it's not for free at all. It's an additional concept that has > to > >> >> > be maintained, documented, and learned (with the last cost, which > is > >> >> > multiplied by the number of users, being by far the greatest). It's > >> >> > not reinventing the wheel, it's saying hey, I have wheels and > axles, > >> >> > but what I really need the library to provide is a wheel+axle > >> >> > assembly! > >> >> > >> >> You're communicating my argument better than I am. > >> >> > >> >> >>> Do we really get much advantage by building all these complex > >> >> >>> operations in? I worry that we're trying to anticipate and write > >> >> >>> code > >> >> >>> for every situation that users find themselves in, instead of > just > >> >> >>> giving them some simple, orthogonal tools. > >> >> >>> > >> >> >> > >> >> >> This is the danger, and which is why I advocate retaining the > >> >> >> MaskedArray > >> >> >> type that would provide the high-level "intelligent" operations, > >> >> >> meanwhile > >> >> >> having in the core the basic data structures for pairing a mask > >> >> >> with > >> >> >> an > >> >> >> array, and to recognize a special np.NA value that would act upon > >> >> >> the > >> >> >> mask > >> >> >> rather than the underlying data. Users would get very basic > >> >> >> functionality, > >> >> >> while the MaskedArray would continue to provide the interface that > >> >> >> we > >> >> >> are > >> >> >> used to. > >> >> > > >> >> > The interface as described is quite different... in particular, all > >> >> > aggregate operations would change their behavior. > >> >> > > >> >> >>> As a corollary, I worry that learning and keeping track of how > >> >> >>> masked > >> >> >>> arrays work is more hassle than just ignoring them and writing > the > >> >> >>> necessary code by hand as needed. Certainly I can imagine that > *if > >> >> >>> the > >> >> >>> mask is a property of the data* then it's useful to have tools to > >> >> >>> keep > >> >> >>> it aligned with the data through indexing and such. But some of > >> >> >>> these > >> >> >>> other things are quicker to reimplement than to look up the docs > >> >> >>> for, > >> >> >>> and the reimplementation is easier to read, at least for me... > >> >> >> > >> >> >> What you are advocating is similar to the "tried-n-true" coding > >> >> >> practice of > >> >> >> Matlab users of using NaNs. You will hear from Matlab programmers > >> >> >> about how > >> >> >> it is the greatest idea since sliced bread (and I was one of > them). > >> >> >> Then I > >> >> >> was introduced to Numpy, and I while I do sometimes still do the > NaN > >> >> >> approach, I realized that the masked array is a "better" way. > >> >> > > >> >> > Hey, no need to go around calling people Matlab programmers, you > >> >> > might > >> >> > hurt someone's feelings. > >> >> > > >> >> > But seriously, my argument is that every abstraction and new > concept > >> >> > has a cost, and I'm dubious that the full masked array abstraction > >> >> > carries its weight and justifies this cost, because it's highly > >> >> > redundant with existing abstractions. That has nothing to do with > how > >> >> > tried-and-true anything is. > >> >> > >> >> +1. I think I will personally only be happy if "masked array" can be > >> >> implemented while incurring near-zero cost from the end user > >> >> perspective. If what we end up with is a faster implementation of > >> >> numpy.ma in C I'm probably going to keep on using NaN... That's why > >> >> I'm entirely insistent that whatever design be dogfooded on > non-expert > >> >> users. If it's very much harder / trickier / nuanced than R, you will > >> >> have failed. > >> >> > >> > > >> > This sounds unduly pessimistic to me. It's one thing to suggest > >> > different > >> > approaches, another to cry doom and threaten to go eat worms. And all > >> > before > >> > the code is written, benchmarks run, or trial made of the usefulness > of > >> > the > >> > approach. Let us see how things look as they get worked out. Mark has > a > >> > good > >> > track record for innovative tools and I'm rather curious myself to see > >> > what > >> > the result is. > >> > > >> > Chuck > >> > > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at scipy.org > >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > > >> > >> I hope you're right. So far it seems that anyone who has spent real > >> time with R (e.g. myself, Nathaniel) has expressed serious concerns > >> about the masked approach. And we got into this discussion at the Data > >> Array summit in Austin last month because we're trying to make Python > >> more competitive with R viz statistical and financial applications. > >> I'm just trying to be (R)ealistic =P Remember that I very earnestly am > >> doing everything I can these days to make scientific Python more > >> successful in finance and statistics. One big difference with R's > >> approach is that we care more about performance the the R community > >> does. So maybe having special NA values will be prohibitive for that > >> reason. > >> > >> Mark indeed has a fantastic track record and I've been extremely > >> impressed with his NumPy work, so I've no doubt he'll do a good job. I > >> just hope that you don't push aside my input-- my opinions are formed > >> entirely based on my domain experience. > >> > > > > I think what we really need to see are the use cases and work flow. The > ones > > that hadn't occurred to me before were memory mapped files and data > stored > > on disk in general. I think we may need some standard format for masked > data > > on disk if we don't go the NA value route. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > Here are some things I can think of that would be affected by any changes > here > > 1) Right now users of pandas can type pandas.isnull(series[5]) and > that will yield True if the value is NA for any dtype. This might be > hard to support in the masked regime > 2) Functions like {Series, DataFrame}.fillna would hopefully look just > like this: > > # value is 0 or some other value to fill > new_series = self.copy() > new_series[isnull(new_series)] = value > > Keep in mind that people will write custom NA handling logic. So they might > do: > > series[isnull(other_series) & isnull(other_series2)] = val > > 3) Nulling / NA-ing out data is very common > > # null out this data up to and including date1 in these three columns > frame.ix[:date1, [col1, col2, col3]] = NaN > > # But this should work fine too > frame.ix[:date1, [col1, col2, col3]] = 0 > > I'll try to think of some others. The main thing is that the NA value > is very easy to think about and fits in naturally with how people (at > least statistical / financial users) think about and work with data. > If you have to say "I have to set these mask locations to True" it > introduces additional mental effort compared with "I'll just set these > values to NA" > _ > You should take a look at the current NEP . You don't have to deal with the mask, you just need to assign np.NA to the array location, like in R (as I understand it). The masks are for the most part transparent to the user. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 25 11:05:46 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 09:05:46 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 8:52 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 3:46 PM, Charles R Harris > wrote: > > > > > > On Sat, Jun 25, 2011 at 8:31 AM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Sat, Jun 25, 2011 at 3:21 PM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM > wrote: > >> >> > >> >> This thread is getting quite long, innit ? > >> >> And I think it's getting a tad confusing, because we're mixing two > >> >> different concepts: missing values and masks. > >> >> There should be support for missing values in numpy.core, I think we > >> >> all > >> >> agree on that. > >> >> * What's been suggested of adding new dtypes (nafloat, naint) is > great, > >> >> by > >> >> why not making it the default, then ? > >> >> > >> >> * Operations involving a NA (whatever the NA actually is, depending > on > >> >> the > >> >> dtype of the input) should result in a NA (whatever the NA defined by > >> >> the > >> >> outputs dtype). That could be done by overloading the existing ufuncs > >> >> to > >> >> support the new dtypes. > >> >> * There should be some simple methods to retrieve the location of > those > >> >> NAs in an array. Whether we just output the indices or a full boolean > >> >> array > >> >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be > >> >> decided. > >> >> * We can always re-implement masked arrays to use these NAs in a way > >> >> which > >> >> would be consistent with numpy.ma (so as not to confuse existing > users > >> >> of > >> >> numpy.ma): a mask would be a boolean array with the same shape than > the > >> >> underlying ndarray, with True for NA. > >> >> Mark, I'd suggest you modify your proposal, making it clearer that > it's > >> >> not to add all of numpy.ma functionalities in the core, but just > >> >> support > >> >> these missing values. Using the term 'mask' should be avoided as much > >> >> as > >> >> possible, use a 'missing data' or whatever. > >> > > >> > I think he aims to support both. > >> > >> I don't think Mark is proposing to support both. He's proposing to > >> implement only array.mask. > >> > > > > I think you are confusing function with implementation. If you look at > the > > current NEP, it does NA but does so by using masks behind the scene in a > > transparent manner. > > Yes, there is some confusion; just to be clear, I'm pointing out that > Mark is not proposing to implement na-dtypes, and is proposing to > implement array.mask. > > I would say he is proposing to implement na-dtypes by means of masks. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 25 11:08:27 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 16:08:27 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 3:44 PM, Wes McKinney wrote: ... > Here are some things I can think of that would be affected by any changes here > > 1) Right now users of pandas can type pandas.isnull(series[5]) and > that will yield True if the value is NA for any dtype. This might be > hard to support in the masked regime But, following the NEP, I could imagine something like this: def isnull(a): if a.validitymask is None: return np.ones(a.shape, dtype=np.bool) return a.validitymask == False I suppose the return array in this case would be 0d bool. Would that not serve here? > 2) Functions like {Series, DataFrame}.fillna would hopefully look just > like this: > > # value is 0 or some other value to fill > new_series = self.copy() > new_series[isnull(new_series)] = value isnull above or: new_series = new_series.fill_masked(value) ? > Keep in mind that people will write custom NA handling logic. So they might do: > > series[isnull(other_series) & isnull(other_series2)] = val > 3) Nulling / NA-ing out data is very common > > # null out this data up to and including date1 in these three columns > frame.ix[:date1, [col1, col2, col3]] = NaN I think Mark is proposing that this: frame.ix[:date1, [col1, col2, col3]] = np.NA will work - maybe he can correct me if I'm wrong? > I'll try to think of some others. The main thing is that the NA value > is very easy to think about and fits in naturally with how people (at > least statistical / financial users) think about and work with data. > If you have to say "I have to set these mask locations to True" it > introduces additional mental effort compared with "I'll just set these > values to NA" I could imagine making the API such that, in practice, you would be thinking that you were setting the values to NA, even though you were in fact setting a mask. My own worry here is not about the API, but the implementation. I'm worried that it is using more memory, and I don't know how we can be sure whether it will be faster without implementing both. See you, Matthew From matthew.brett at gmail.com Sat Jun 25 11:10:40 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 16:10:40 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: Hi, On Sat, Jun 25, 2011 at 4:05 PM, Charles R Harris wrote: > > > On Sat, Jun 25, 2011 at 8:52 AM, Matthew Brett > wrote: >> >> Hi, >> >> On Sat, Jun 25, 2011 at 3:46 PM, Charles R Harris >> wrote: >> > >> > >> > On Sat, Jun 25, 2011 at 8:31 AM, Matthew Brett >> > wrote: >> >> >> >> Hi, >> >> >> >> On Sat, Jun 25, 2011 at 3:21 PM, Charles R Harris >> >> wrote: >> >> > >> >> > >> >> > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM >> >> > wrote: >> >> >> >> >> >> This thread is getting quite long, innit ? >> >> >> And I think it's getting a tad confusing, because we're mixing two >> >> >> different concepts: missing values and masks. >> >> >> There should be support for missing values in numpy.core, I think we >> >> >> all >> >> >> agree on that. >> >> >> * What's been suggested of adding new dtypes (nafloat, naint) is >> >> >> great, >> >> >> by >> >> >> why not making it the default, then ? >> >> >> >> >> >> * Operations involving a NA (whatever the NA actually is, depending >> >> >> on >> >> >> the >> >> >> dtype of the input) should result in a NA (whatever the NA defined >> >> >> by >> >> >> the >> >> >> outputs dtype). That could be done by overloading the existing >> >> >> ufuncs >> >> >> to >> >> >> support the new dtypes. >> >> >> * There should be some simple methods to retrieve the location of >> >> >> those >> >> >> NAs in an array. Whether we just output the indices or a full >> >> >> boolean >> >> >> array >> >> >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be >> >> >> decided. >> >> >> * We can always re-implement masked arrays to use these NAs in a way >> >> >> which >> >> >> would be consistent with numpy.ma (so as not to confuse existing >> >> >> users >> >> >> of >> >> >> numpy.ma): a mask would be a boolean array with the same shape than >> >> >> the >> >> >> underlying ndarray, with True for NA. >> >> >> Mark, I'd suggest you modify your proposal, making it clearer that >> >> >> it's >> >> >> not to add all of numpy.ma functionalities in the core, but just >> >> >> support >> >> >> these missing values. Using the term 'mask' should be avoided as >> >> >> much >> >> >> as >> >> >> possible, use a 'missing data' or whatever. >> >> > >> >> > I think he aims to support both. >> >> >> >> I don't think Mark is proposing to support both. ?He's proposing to >> >> implement only array.mask. >> >> >> > >> > I think you are confusing function with implementation. If you look at >> > the >> > current NEP, it does NA but does so by using masks behind the scene in a >> > transparent manner. >> >> Yes, there is some confusion; just to be clear, I'm pointing out that >> Mark is not proposing to implement na-dtypes, and is proposing to >> implement array.mask. >> > > I would say he is proposing to implement na-dtypes by means of masks. But that would be confusing function with implementation. array.mask is the implementation with an extra mask array. na-dtypes is the implementation with values in the array set to special missing values. See you, Matthew From shish at keba.be Sat Jun 25 11:12:49 2011 From: shish at keba.be (Olivier Delalleau) Date: Sat, 25 Jun 2011 11:12:49 -0400 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: 2011/6/25 Charles R Harris > I think what we really need to see are the use cases and work flow. The > ones that hadn't occurred to me before were memory mapped files and data > stored on disk in general. I think we may need some standard format for > masked data on disk if we don't go the NA value route. > > Chuck > Sorry I can't follow closely this thread, but since use cases are mentioned, I'll throw one. It may or may not already be supported by the current NEP, I don't know. It may not be possible to support it either... anyway: I typically use NaN to denote missing values in an array. The main motivation vs. masks is my code typically involves multiple systems working together, and not all of them are necessarily based on numpy. The nice thing with NaN is it is preserved e.g. when I turn an array into a list, when I write the data directly into a double* C array, etc. It's handy to manipulate arrays with missing data without being constrained to a specific container. The one thing that is annoying with NaN is that NaN != NaN. It makes code involving comparisons quite ugly. -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 25 11:16:50 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 16:16:50 +0100 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> Message-ID: Hi, On Sat, Jun 25, 2011 at 3:27 PM, Charles R Harris wrote: > > > On Sat, Jun 25, 2011 at 6:00 AM, Gael Varoquaux > wrote: >> >> On Sat, Jun 25, 2011 at 01:02:07AM +0100, Matthew Brett wrote: >> > I'm personally worried that the memory overhead of array.masks will >> > make many of us tend to avoid them. ?I work with images that can >> > easily get large enough that I would not want an array-items size byte >> > array added to my storage. >> >> I work with the same kind of data ( :D ). >> >> The way we manipulate our data, in my lab, is, to represent 3D data with >> a mask to use a 1D array and a corresponding 3D mask. It reduces the >> memory footprint, to the cost of loosing the 3D shape of the data. >> >> I am raising this because it is an option that has not been suggested so >> far. I am not saying that it should be the option used to implement mask >> arrays, I am just saying that there are many different ways of doing it, >> and I don't think that there is a one-size-fits-all solution. >> >> I tend to feel like Wes: good building block to easily implement our own >> solutions is what we want. >> > > Could you expand a bit on what sort of data you have and how you deal with > it. Where does it come from, how is it stored on disk, what do you do with > it? That sort of thing. Gael and I groan with the same groan on this one. Our input data are typically 4D and 3D medical images from MRI scanners. The 4D images are usually whole brain scans (3D) collected every few seconds and then concatenated to form the fourth dimension. The data formats are typically very simple binary data in C float etc format, with some associated metadata like the image shape, data type, and sometimes, scaling factors to be applied to the integers on disk, in order to get the desired output values. The formats generally allow the images to be compressed (and still be valid without decompression on the filesystem). One major package (in matlab) insists that these images are not compressed so that the package can memory map the arrays on disk, and, when reading the data via a custom image object, apply the scalefactors to the integer image values on the fly. As Gael says, we often find that only - say - 50% of the image is of interest - inside the brain. In that case it's common to signal pixels as not being of interest, using NaN values. This has the advantage that we can save the masked images to the simple image formats by using a datatype that supports NaN. Gael is pointing out a new masked array concept that uses less memory then the original. It's easy to imagine something like his suggestion being implemented with Travis' deferred array proposal - I remember getting quite excited about that when I saw it. In practice, trying to deal with these images often involves optimization for memory - in particular - because they are often too large to conveniently fit in memory, or to keep more than a few in memory. Cheers, Matthew From njs at pobox.com Sat Jun 25 12:05:54 2011 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 25 Jun 2011 09:05:54 -0700 Subject: [Numpy-discussion] Concepts for masked/missing data Message-ID: So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one? So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just "decide" that some data is no longer missing. It makes no sense to ask what value is "really" there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. 3) The "all things to all people" approach: implement every feature implied by either (1) or (2), and switch back and forth between these conceptual frameworks whenever necessary to make sense of the resulting code. The advantage of deciding up front what our model is is that it makes a lot of other questions easier. E.g., someone asked in the other thread whether, after setting an array element to NA, it would be possible to get back the original value. If we follow (1), the answer is obviously "no", if we follow (2), the answer is obviously "yes", and if we follow (3), the answer is obviously "yes, probably, well, maybe you better check the docs?". My personal opinions on these are: (1): This is a real problem I face, and there isn't any good solution now. Support for this in numpy would be awesome. (2): This feels more like a convenience feature to me; we already have lots of ways to work with subsets of data. I probably wouldn't bother using it, but that's fine -- I don't use np.matrix either, but some people like it. (3): Well, it's a bit of a mess, but I guess it might be better than nothing? But that's just my opinion. I'm wondering if we can get any consensus on which of these we actually *want* (or maybe we want some fourth option!), and *then* we can try to figure out the best way to get there? Pretty much any implementation strategy we've talked about could work for any of these, but hard to decide between them if we don't even know what we're trying to do... -- Nathaniel From charlesr.harris at gmail.com Sat Jun 25 12:17:22 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 10:17:22 -0600 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith wrote: > So obviously there's a lot of interest in this question, but I'm > losing track of all the different issues that've being raised in the > 150-post thread of doom. I think I'll find this easier if we start by > putting aside the questions about implementation and such and focus > for now on the *conceptual model* that we want. Maybe I'm not the only > one? > > So as far as I can tell, there are three different ways of thinking > about masked/missing data that people have been using in the other > thread: > > 1) Missingness is part of the data. Some data is missing, some isn't, > this might change through computation on the data (just like some data > might change from a 3 to a 6 when we apply some transformation, NA | > True could be True, instead of NA), but we can't just "decide" that > some data is no longer missing. It makes no sense to ask what value is > "really" there underneath the missingness. And It's critical that we > keep track of this through all operations, because otherwise we may > silently give incorrect answers -- exactly like it's critical that we > keep track of the difference between 3 and 6. > > 2) All the data exists, at least in some sense, but we don't always > want to look at all of it. We lay a mask over our data to view and > manipulate only parts of it at a time. We might want to use different > masks at different times, mutate the mask as we go, etc. The most > important thing is to provide convenient ways to do complex > manipulations -- preserve masks through indexing operations, overlay > the mask from one array on top of another array, etc. When it comes to > other sorts of operations then we'd rather just silently skip the > masked values -- we know there are values that are masked, that's the > whole point, to work with the unmasked subset of the data, so if sum > returned NA then that would just be a stupid hassle. > > 3) The "all things to all people" approach: implement every feature > implied by either (1) or (2), and switch back and forth between these > conceptual frameworks whenever necessary to make sense of the > resulting code. > > The advantage of deciding up front what our model is is that it makes > a lot of other questions easier. E.g., someone asked in the other > thread whether, after setting an array element to NA, it would be > possible to get back the original value. If we follow (1), the answer > is obviously "no", if we follow (2), the answer is obviously "yes", > and if we follow (3), the answer is obviously "yes, probably, well, > maybe you better check the docs?". > > My personal opinions on these are: > (1): This is a real problem I face, and there isn't any good solution > now. Support for this in numpy would be awesome. > (2): This feels more like a convenience feature to me; we already have > lots of ways to work with subsets of data. I probably wouldn't bother > using it, but that's fine -- I don't use np.matrix either, but some > people like it. > (3): Well, it's a bit of a mess, but I guess it might be better than > nothing? > > But that's just my opinion. I'm wondering if we can get any consensus > on which of these we actually *want* (or maybe we want some fourth > option!), and *then* we can try to figure out the best way to get > there? Pretty much any implementation strategy we've talked about > could work for any of these, but hard to decide between them if we > don't even know what we're trying to do... > > I go for 3 ;) And I think that is where we are heading. By default, masked array operations look like 1), but by taking views one can get 2). I think the crucial aspect here is the use of views, which both saves on storage and fits with the current numpy concept of views. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 25 12:26:25 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 17:26:25 +0100 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: Hi, On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith wrote: > So obviously there's a lot of interest in this question, but I'm > losing track of all the different issues that've being raised in the > 150-post thread of doom. I think I'll find this easier if we start by > putting aside the questions about implementation and such and focus > for now on the *conceptual model* that we want. Maybe I'm not the only > one? > > So as far as I can tell, there are three different ways of thinking > about masked/missing data that people have been using in the other > thread: > > 1) Missingness is part of the data. Some data is missing, some isn't, > this might change through computation on the data (just like some data > might change from a 3 to a 6 when we apply some transformation, NA | > True could be True, instead of NA), but we can't just "decide" that > some data is no longer missing. It makes no sense to ask what value is > "really" there underneath the missingness. And It's critical that we > keep track of this through all operations, because otherwise we may > silently give incorrect answers -- exactly like it's critical that we > keep track of the difference between 3 and 6. So far I see the difference between 1) and 2) being that you cannot unmask. So, if you didn't even know you could unmask data, then it would not matter that 1) was being implemented by masks? > 2) All the data exists, at least in some sense, but we don't always > want to look at all of it. We lay a mask over our data to view and > manipulate only parts of it at a time. We might want to use different > masks at different times, mutate the mask as we go, etc. The most > important thing is to provide convenient ways to do complex > manipulations -- preserve masks through indexing operations, overlay > the mask from one array on top of another array, etc. When it comes to > other sorts of operations then we'd rather just silently skip the > masked values -- we know there are values that are masked, that's the > whole point, to work with the unmasked subset of the data, so if sum > returned NA then that would just be a stupid hassle. To clarify, you're proposing for: a = np.sum(np.array([np.NA, np.NA]) 1) -> np.NA 2) -> 0.0 ? > But that's just my opinion. I'm wondering if we can get any consensus > on which of these we actually *want* (or maybe we want some fourth > option!), and *then* we can try to figure out the best way to get > there? Pretty much any implementation strategy we've talked about > could work for any of these, but hard to decide between them if we > don't even know what we're trying to do... I agree it's good to separate the API from the implementation. I think the implementation is also important because I care about memory and possibly speed. But, that is a separate problem from the API... Cheers, Matthew From charlesr.harris at gmail.com Sat Jun 25 12:48:05 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jun 2011 10:48:05 -0600 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 10:26 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith wrote: > > So obviously there's a lot of interest in this question, but I'm > > losing track of all the different issues that've being raised in the > > 150-post thread of doom. I think I'll find this easier if we start by > > putting aside the questions about implementation and such and focus > > for now on the *conceptual model* that we want. Maybe I'm not the only > > one? > > > > So as far as I can tell, there are three different ways of thinking > > about masked/missing data that people have been using in the other > > thread: > > > > 1) Missingness is part of the data. Some data is missing, some isn't, > > this might change through computation on the data (just like some data > > might change from a 3 to a 6 when we apply some transformation, NA | > > True could be True, instead of NA), but we can't just "decide" that > > some data is no longer missing. It makes no sense to ask what value is > > "really" there underneath the missingness. And It's critical that we > > keep track of this through all operations, because otherwise we may > > silently give incorrect answers -- exactly like it's critical that we > > keep track of the difference between 3 and 6. > > So far I see the difference between 1) and 2) being that you cannot > unmask. So, if you didn't even know you could unmask data, then it > would not matter that 1) was being implemented by masks? > > > 2) All the data exists, at least in some sense, but we don't always > > want to look at all of it. We lay a mask over our data to view and > > manipulate only parts of it at a time. We might want to use different > > masks at different times, mutate the mask as we go, etc. The most > > important thing is to provide convenient ways to do complex > > manipulations -- preserve masks through indexing operations, overlay > > the mask from one array on top of another array, etc. When it comes to > > other sorts of operations then we'd rather just silently skip the > > masked values -- we know there are values that are masked, that's the > > whole point, to work with the unmasked subset of the data, so if sum > > returned NA then that would just be a stupid hassle. > > To clarify, you're proposing for: > > a = np.sum(np.array([np.NA, np.NA]) > > 1) -> np.NA > 2) -> 0.0 > > ? > > > But that's just my opinion. I'm wondering if we can get any consensus > > on which of these we actually *want* (or maybe we want some fourth > > option!), and *then* we can try to figure out the best way to get > > there? Pretty much any implementation strategy we've talked about > > could work for any of these, but hard to decide between them if we > > don't even know what we're trying to do... > > I agree it's good to separate the API from the implementation. I > think the implementation is also important because I care about memory > and possibly speed. But, that is a separate problem from the API... > > In a larger sense, we are seeking to add metadata to array elements and have ufuncs that use that metadata together with the element values to compute results. Off topic a bit, but it reminds me of the Burroughs 6600 that I once used. The word size on that machine was 48 bits, so it could accommodate both 6 and 8 bit characters, and 3 bits of metadata were appended to mark the type. So there was a machine with 51 bit words ;) IIRC, Knuth was involved in the design and helped with the OS, which was written in ALGOL... Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Jun 25 13:05:38 2011 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 25 Jun 2011 10:05:38 -0700 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett wrote: > So far I see the difference between 1) and 2) being that you cannot > unmask. ?So, if you didn't even know you could unmask data, then it > would not matter that 1) was being implemented by masks? I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. It seems like some of us have been talking past each other a lot, where someone says "but changing masks is the single most important feature!" and then someone else says "what are you talking about that doesn't even make sense". > To clarify, you're proposing for: > > a = np.sum(np.array([np.NA, np.NA]) > > 1) -> np.NA > 2) -> 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's because they're designed as coherent systems that are trying to solve different problems. (Well, numpy.ma's "hardmask" idea seems inspired by the missing-data concept rather than the temporary-mask concept, but aside from that it seems pretty consistent in implementing option 2.) Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. They're both internally consistent, but I think we might have to make a decision and stick to it. > I agree it's good to separate the API from the implementation. ? I > think the implementation is also important because I care about memory > and possibly speed. ?But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). -- Nathaniel From matthew.brett at gmail.com Sat Jun 25 13:16:56 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 25 Jun 2011 18:16:56 +0100 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: Hi, On Sat, Jun 25, 2011 at 6:05 PM, Nathaniel Smith wrote: > Yes, absolutely memory and speed are important. But a really fast > solution to the wrong problem isn't so useful either :-). Would you be happy with me summarizing your idea as 1) = NA logic / API 2) = mask logic / API ? It might be possible to create (most of) an NA logic API from a mask implementation, and in fact, I think that's more or less what is being proposed. It's possible to imagine exposing an apparently separate API for mask operations, but which in fact uses the same implementation. Mark, Chuck - is that right? If we choose a mask implementation, that will have consequences for memory (predictable) and performance (unpredictable). Is that a good summary? See you, Matthew From wesmckinn at gmail.com Sat Jun 25 13:17:48 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 25 Jun 2011 13:17:48 -0400 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett wrote: >> So far I see the difference between 1) and 2) being that you cannot >> unmask. ?So, if you didn't even know you could unmask data, then it >> would not matter that 1) was being implemented by masks? > > I guess that is a difference, but I'm trying to get at something more > fundamental -- not just what operations are allowed, but what > operations people *expect* to be allowed. It seems like some of us > have been talking past each other a lot, where someone says "but > changing masks is the single most important feature!" and then someone > else says "what are you talking about that doesn't even make sense". > >> To clarify, you're proposing for: >> >> a = np.sum(np.array([np.NA, np.NA]) >> >> 1) -> np.NA >> 2) -> 0.0 > > Yes -- and in R you get actually do get NA, while in numpy.ma you > actually do get 0. I don't think this is a coincidence; I think it's > because they're designed as coherent systems that are trying to solve > different problems. (Well, numpy.ma's "hardmask" idea seems inspired > by the missing-data concept rather than the temporary-mask concept, > but aside from that it seems pretty consistent in implementing option > 2.) Agree. My basic observation about numpy.ma is that it's a finely crafted solution for a different set of problems than the ones I have. I just don't want the same thing to happen here so I'm stuck writing code (like I am now) that looks like mask = y.mask the_sum = y.sum(axis) the_count = mask.sum(axis) the_sum[the_count == 0] = nan > Here's another possible difference -- in (1), intuitively, missingness > is a property of the data, so the logical place to put information > about whether you can expect missing values is in the dtype, and to > enable missing values you need to make a new array with a new dtype. > (If we use a mask-based implementation, then > np.asarray(nomissing_array, dtype=yesmissing_type) would still be able > to skip making a copy of the data -- I'm talking ONLY about the > interface here, not whether missing data has a different storage > format from non-missing data.) > > In (2), the whole point is to use different masks with the same data, > so I'd argue masking should be a property of the array object rather > than the dtype, and the interface should logically allow masks to be > created, modified, and destroyed in place. > > They're both internally consistent, but I think we might have to make > a decision and stick to it. > >> I agree it's good to separate the API from the implementation. ? I >> think the implementation is also important because I care about memory >> and possibly speed. ?But, that is a separate problem from the API... > > Yes, absolutely memory and speed are important. But a really fast > solution to the wrong problem isn't so useful either :-). > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From ben.root at ou.edu Sat Jun 25 14:06:14 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 13:06:14 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith wrote: > > So obviously there's a lot of interest in this question, but I'm > > losing track of all the different issues that've being raised in the > > 150-post thread of doom. I think I'll find this easier if we start by > > putting aside the questions about implementation and such and focus > > for now on the *conceptual model* that we want. Maybe I'm not the only > > one? > > > > So as far as I can tell, there are three different ways of thinking > > about masked/missing data that people have been using in the other > > thread: > > > > 1) Missingness is part of the data. Some data is missing, some isn't, > > this might change through computation on the data (just like some data > > might change from a 3 to a 6 when we apply some transformation, NA | > > True could be True, instead of NA), but we can't just "decide" that > > some data is no longer missing. It makes no sense to ask what value is > > "really" there underneath the missingness. And It's critical that we > > keep track of this through all operations, because otherwise we may > > silently give incorrect answers -- exactly like it's critical that we > > keep track of the difference between 3 and 6. > > So far I see the difference between 1) and 2) being that you cannot > unmask. So, if you didn't even know you could unmask data, then it > would not matter that 1) was being implemented by masks? > > Yes, bingo, you hit it right on the nose. Essentially, 1) could be considered the "hard mask", while 2) would be the "soft mask". Everything else is implementation details. > > 2) All the data exists, at least in some sense, but we don't always > > want to look at all of it. We lay a mask over our data to view and > > manipulate only parts of it at a time. We might want to use different > > masks at different times, mutate the mask as we go, etc. The most > > important thing is to provide convenient ways to do complex > > manipulations -- preserve masks through indexing operations, overlay > > the mask from one array on top of another array, etc. When it comes to > > other sorts of operations then we'd rather just silently skip the > > masked values -- we know there are values that are masked, that's the > > whole point, to work with the unmasked subset of the data, so if sum > > returned NA then that would just be a stupid hassle. > > To clarify, you're proposing for: > > a = np.sum(np.array([np.NA, np.NA]) > > 1) -> np.NA > 2) -> 0.0 > > ? > Actually, I have always considered this to be a bug. Note that "np.sum([])" also returns 0.0. I think the reason why it has been returning zero instead of NaN was because there wasn't a NaN-equivalent for integers. This is where I think a np.NA could best serve NumPy by providing a dtype-agnostic way to represent missing or invalid data. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.isaac at gmail.com Sat Jun 25 14:18:28 2011 From: alan.isaac at gmail.com (Alan G Isaac) Date: Sat, 25 Jun 2011 14:18:28 -0400 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: <4E062674.4010406@gmail.com> On 6/25/2011 2:06 PM, Benjamin Root wrote: > Note that "np.sum([])" also returns 0.0. I think the > reason why it has been returning zero instead of NaN was > because there wasn't a NaN-equivalent for integers. http://en.wikipedia.org/wiki/Empty_sum fwiw, Alan Isaac From ben.root at ou.edu Sat Jun 25 14:32:54 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 13:32:54 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett > wrote: > > So far I see the difference between 1) and 2) being that you cannot > > unmask. So, if you didn't even know you could unmask data, then it > > would not matter that 1) was being implemented by masks? > > I guess that is a difference, but I'm trying to get at something more > fundamental -- not just what operations are allowed, but what > operations people *expect* to be allowed. That is quite a trickier problem. > > Here's another possible difference -- in (1), intuitively, missingness > is a property of the data, so the logical place to put information > about whether you can expect missing values is in the dtype, and to > enable missing values you need to make a new array with a new dtype. > (If we use a mask-based implementation, then > np.asarray(nomissing_array, dtype=yesmissing_type) would still be able > to skip making a copy of the data -- I'm talking ONLY about the > interface here, not whether missing data has a different storage > format from non-missing data.) > > In (2), the whole point is to use different masks with the same data, > so I'd argue masking should be a property of the array object rather > than the dtype, and the interface should logically allow masks to be > created, modified, and destroyed in place. > > I can agree with this distinction. However, if "missingness" is an intrinsic property of the data, then shouldn't users be implementing their own dtype tailored to the data they are using? In other words, how far does the core of NumPy need to go to address this issue? And how far would be "too much"? > They're both internally consistent, but I think we might have to make > a decision and stick to it. > > Of course. I think that Mark is having a very inspired idea of giving the R audience what they want (np.NA), while simultaneously making the use of masked arrays even easier (which I can certainly appreciate). > > I agree it's good to separate the API from the implementation. I > > think the implementation is also important because I care about memory > > and possibly speed. But, that is a separate problem from the API... > > Yes, absolutely memory and speed are important. But a really fast > solution to the wrong problem isn't so useful either :-). > > The one thing I have always loved about Python (and NumPy) is that "it respects the developer's time". I come from a C++ background where I found C++ to be powerful, but tedious. I went to Matlab because it was just straight-up easier to code math and display graphs. (If anybody here ever used GrADS, then you know how badly I would want a language that respected my time). However, even Matlab couldn't fully respect my time as I usually kept wasting it trying to get various pieces working. Python came along, and while it didn't always match the speed of some of my matlab programs, it was "fast enough". I will put out a little disclaimer. I once had to use S+ for a class. To be honest, it was the worst programming experience in my life. This experience may be coloring my perception of R's approach to handling missing data. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From efiring at hawaii.edu Sat Jun 25 14:50:32 2011 From: efiring at hawaii.edu (Eric Firing) Date: Sat, 25 Jun 2011 08:50:32 -1000 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: <4E062DF8.5010409@hawaii.edu> On 06/25/2011 07:05 AM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett wrote: >> So far I see the difference between 1) and 2) being that you cannot >> unmask. So, if you didn't even know you could unmask data, then it >> would not matter that 1) was being implemented by masks? > > I guess that is a difference, but I'm trying to get at something more > fundamental -- not just what operations are allowed, but what > operations people *expect* to be allowed. It seems like some of us > have been talking past each other a lot, where someone says "but > changing masks is the single most important feature!" and then someone > else says "what are you talking about that doesn't even make sense". > >> To clarify, you're proposing for: >> >> a = np.sum(np.array([np.NA, np.NA]) >> >> 1) -> np.NA >> 2) -> 0.0 > > Yes -- and in R you get actually do get NA, while in numpy.ma you > actually do get 0. I don't think this is a coincidence; I think it's No, you don't: In [2]: np.ma.array([2, 4], mask=[True, True]).sum() Out[2]: masked In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) Out[4]: masked Eric > because they're designed as coherent systems that are trying to solve > different problems. (Well, numpy.ma's "hardmask" idea seems inspired > by the missing-data concept rather than the temporary-mask concept, > but aside from that it seems pretty consistent in implementing option > 2.) > > Here's another possible difference -- in (1), intuitively, missingness > is a property of the data, so the logical place to put information > about whether you can expect missing values is in the dtype, and to > enable missing values you need to make a new array with a new dtype. > (If we use a mask-based implementation, then > np.asarray(nomissing_array, dtype=yesmissing_type) would still be able > to skip making a copy of the data -- I'm talking ONLY about the > interface here, not whether missing data has a different storage > format from non-missing data.) > > In (2), the whole point is to use different masks with the same data, > so I'd argue masking should be a property of the array object rather > than the dtype, and the interface should logically allow masks to be > created, modified, and destroyed in place. > > They're both internally consistent, but I think we might have to make > a decision and stick to it. > >> I agree it's good to separate the API from the implementation. I >> think the implementation is also important because I care about memory >> and possibly speed. But, that is a separate problem from the API... > > Yes, absolutely memory and speed are important. But a really fast > solution to the wrong problem isn't so useful either :-). > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From njs at pobox.com Sat Jun 25 14:57:59 2011 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 25 Jun 2011 11:57:59 -0700 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: <4E062DF8.5010409@hawaii.edu> References: <4E062DF8.5010409@hawaii.edu> Message-ID: On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing wrote: > On 06/25/2011 07:05 AM, Nathaniel Smith wrote: >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett ?wrote: >>> To clarify, you're proposing for: >>> >>> a = np.sum(np.array([np.NA, np.NA]) >>> >>> 1) -> ?np.NA >>> 2) -> ?0.0 >> >> Yes -- and in R you get actually do get NA, while in numpy.ma you >> actually do get 0. I don't think this is a coincidence; I think it's > > No, you don't: > > In [2]: np.ma.array([2, 4], mask=[True, True]).sum() > Out[2]: masked > > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) > Out[4]: masked Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but sum([NA]) and sum([]) are different? Sounds to me like you should file a bug on numpy.ma... Anyway, the general point is that in R, NA's propagate, and in numpy.ma, masked values are ignored (except, apparently, if all values are masked). Here, I actually checked these: Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4 R: sum(c(NA, 4)) -> NA -- Nathaniel From ben.root at ou.edu Sat Jun 25 14:58:56 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 13:58:56 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 12:17 PM, Wes McKinney wrote: > > Agree. My basic observation about numpy.ma is that it's a finely > crafted solution for a different set of problems than the ones I have. > I just don't want the same thing to happen here so I'm stuck writing > code (like I am now) that looks like > > mask = y.mask > the_sum = y.sum(axis) > the_count = mask.sum(axis) > the_sum[the_count == 0] = nan > > Yeah, you are expecting NaN behavior there, in which case, using NaNs without masks is fine. But, for a general solution, you might want to consider that masked_array provides a "recordmask" as well as a mask. Although, the documentation is very lacking, and using masked arrays with record arrays seems very clumsy, but that is certainly something that could get cleaned up. Also, as a cleaner version of your code: the_sum = y.sum(axis) the_sum.mask = np.any(y.mask, axis) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Sat Jun 25 15:09:38 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 14:09:38 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: <4E062DF8.5010409@hawaii.edu> Message-ID: On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing wrote: > > On 06/25/2011 07:05 AM, Nathaniel Smith wrote: > >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett > wrote: > >>> To clarify, you're proposing for: > >>> > >>> a = np.sum(np.array([np.NA, np.NA]) > >>> > >>> 1) -> np.NA > >>> 2) -> 0.0 > >> > >> Yes -- and in R you get actually do get NA, while in numpy.ma you > >> actually do get 0. I don't think this is a coincidence; I think it's > > > > No, you don't: > > > > In [2]: np.ma.array([2, 4], mask=[True, True]).sum() > > Out[2]: masked > > > > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) > > Out[4]: masked > > Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but > sum([NA]) and sum([]) are different? Sounds to me like you should file > a bug on numpy.ma... > Actually, no... I should have tested this before replying earlier: >>> a = np.ma.array([2, 4], mask=[True, True]) >>> a masked_array(data = [-- --], mask = [ True True], fill_value = 999999) >>> a.sum() masked >>> a = np.ma.array([], mask=[]) >>> a >>> a masked_array(data = [], mask = [], fill_value = 1e+20) >>> a.sum() masked They are the same. > Anyway, the general point is that in R, NA's propagate, and in > numpy.ma, masked values are ignored (except, apparently, if all values > are masked). Here, I actually checked these: > > Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4 > R: sum(c(NA, 4)) -> NA > > If you want NaN behavior, then use NaNs. If you want masked behavior, then use masks. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Sat Jun 25 15:15:31 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 14:15:31 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: <4E062674.4010406@gmail.com> References: <4E062674.4010406@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 1:18 PM, Alan G Isaac wrote: > On 6/25/2011 2:06 PM, Benjamin Root wrote: > > Note that "np.sum([])" also returns 0.0. I think the > > reason why it has been returning zero instead of NaN was > > because there wasn't a NaN-equivalent for integers. > > > http://en.wikipedia.org/wiki/Empty_sum > > Ah, thanks. This then does raise the question of what masked arrays should do in the case of a completely masked out record (or field, or whatever). Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 15:24:08 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 14:24:08 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 2:09 PM, Benjamin Root wrote: > > Another example of how we use masks in matplotlib is in pcolor(). We > have > > to combine the possible masks of X, Y, and V in both the x and y > directions > > to find the final mask to use for the final output result (because each > > facet needs valid data at each corner). Having a soft-mask > implementation > > allows one to create a temporary mask to use for the operation, and to > share > > that mask across all the input data, but then let the data structures > retain > > their original masks when done. > > This is a situation where I would just... use an array and a mask, > rather than a masked array. Then lots of things -- changing fill > values, temporarily masking/unmasking things, etc. -- come from free, > just from knowing how arrays and boolean indexing work? > For free, in sort of the same way that a C pointer, a dtype, a shape, and an array of strides gives you the NumPy ndarray for free. That example is a bit more extreme, but the same idea applies. > Do we really get much advantage by building all these complex > operations in? I worry that we're trying to anticipate and write code > for every situation that users find themselves in, instead of just > giving them some simple, orthogonal tools. > Simple orthogonal tools is what I'm aiming for. That's driven by the myriad use cases, so it's important to understand all of them and to allow them to hone the design. As a corollary, I worry that learning and keeping track of how masked > arrays work is more hassle than just ignoring them and writing the > necessary code by hand as needed. Certainly I can imagine that *if the > mask is a property of the data* then it's useful to have tools to keep > it aligned with the data through indexing and such. But some of these > other things are quicker to reimplement than to look up the docs for, > and the reimplementation is easier to read, at least for me... > This is where designing the interface NumPy exposes comes in. Whether NA values are represented with a separate mask or with a special bit pattern, the missing value API should be basically the same, because it's for the users of the system, the representation is an implementation detail. It's still important, of course, but a separate issue which is being treated as the same in many comments in this thread. -Mark (By the way, is this hard-mask/soft-mask stuff documented anywhere? I > spent some time groveling over the numpy docs with google's help, and > I still have no idea what it actually is.) > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 15:35:40 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 14:35:40 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 8:25 PM, Benjamin Root wrote: > On Fri, Jun 24, 2011 at 8:00 PM, Mark Wiebe wrote: > >> On Fri, Jun 24, 2011 at 6:22 PM, Wes McKinney wrote: >> >>> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris >>> wrote: >>> > >>> > >>> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett < >>> matthew.brett at gmail.com> >>> > wrote: >>> >> >>> >> Hi, >>> >> >>> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root >>> wrote: >>> >> ... >>> >> > Again, there are pros and cons either way and I see them very >>> orthogonal >>> >> > and >>> >> > complementary. >>> >> >>> >> That may be true, but I imagine only one of them will be implemented. >>> >> >>> >> @Mark - I don't have a clear idea whether you consider the nafloat64 >>> >> option to be still in play as the first thing to be implemented >>> >> (before array.mask). If it is, what kind of thing would persuade you >>> >> either way? >>> >> >>> > >>> > Mark can speak for himself, but I think things are tending towards >>> masks. >>> > They have the advantage of one implementation for all data types, >>> current >>> > and future, and they are more flexible since the masked data can be >>> actual >>> > valid data that you just choose to ignore for experimental reasons. >>> > >>> > What might be helpful is a routine to import/export R files, but that >>> > shouldn't be to difficult to implement. >>> > >>> > Chuck >>> > >>> > >>> > _______________________________________________ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion at scipy.org >>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> > >>> > >>> >>> Perhaps we should make a wiki page someplace summarizing pros and cons >>> of the various implementation approaches? I worry very seriously about >>> adding API functions relating to masks rather than having special NA >>> values which propagate in algorithms. The question is: will Joe Blow >>> Former R user have to understand what is the mask and how to work with >>> it? If the answer is yes we have a problem. If it can be completely >>> hidden as an implementation detail, that's great. In R NAs are just >>> sort of inherent-- they propagate you deal with them when you have to >>> via na.rm flag in functions or is.na. >>> >> >> I think the interface for how it looks in NumPy can be made to be pretty >> close to the same with either design approach. I've updated the NEP to add >> and emphasize using masked values with an np.NA singleton, with the >> validitymask as the implementation mechanism which is still accessible for >> those who want to still deal with the mask directly. >> > > I think there are a lot of benefits to this idea, if I understand it > correctly. Essentially, if I were to assign np.NA to an element (or a > slice) of a numpy array, rather than actually assigning that value to that > spot in the array, it would set the mask to True for those elements? > That's correct. I see a lot of benefits to this idea. Imagine in pylab mode (from pylab > import *), users would have the NA name right in their namespace, just like > they are used to with R. And those who want could still mess around with > masks much like we are used to. Plus, I think we can still retain the good > ol' C-pointer to regular data. My question is this. Will it be a soft or > hard mask? In other words, if I were to assign np.NA to a spot in an array, > would it kill the value that was there (if it already was initialized)? > Would I still be able to share masks? > Everything so far being discussed has been soft masks, from my understanding of the soft/hard mask distinction. Whether to support hard masks is still an open question, with my only use case as being a more reasonable return value for boolean indexing than is currently used. I don't believe sharing masks will be possible, because it would directly violate the missing value abstraction. Can you describe an example where you are sharing masks? Admittedly, it is a munging of two distinct ideas, but, I think the end > result would still be the same. > It seems to me that the only distinction between using a mask versus a an NA bit pattern is that the NA bit pattern causes the underlying value to be destroyed when assigning np.NA. So I'm thinking of the mask as supporting everything the NA bit pattern does, and a bit more. >> >>> The other problem I can think of with masks is the extra memory >>> footprint, though maybe this is no cause for concern. >>> >> >> The overhead is definitely worth considering, along with the extra memory >> traffic it generates, and I've basically concluded that the increased >> generality and flexibility is worth the added cost. >> >> > If we go with the mask approach, one could later try and optimize the > implementation of the masks to reduce the memory footprint. Potentially, > one could reduce the footprint by 7/8ths! Maybe some sneaky striding tricks > could help keep from too much cache misses (or go with the approach Pierre > mentioned which was to calculate them all and let the masks sort them out > afterwards). > This could be very tricky to implement, especially supporting views. I do see this as a strong argument for completely hiding the mask memory from the user of the system, only being able to access it by getting a copy. Without a strong data-hiding approach such as this, the 7/8th memory footprint optimization is basically precluded. > As a complete side-thought, I wonder how sparse arrays could play into this > discussion? > Sparse matrices usually mean 0 values aren't stored, but a sparse storage mechanism for the data being represented by masked arrays when most of the elements are masked is a closely related and important problem. It doesn't have much bearing on the present discussion, since the usage patterns have to be dramatically different to support efficiency. -Mark > > Ben Root > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Sat Jun 25 15:44:12 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sat, 25 Jun 2011 14:44:12 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 9:21 AM, Charles R Harris wrote: > > I think he aims to support both. One complication with masks is keeping > them tied to the data on disk. With na values one file can contain both the > data and the missing data markers, whereas with masks, two files would be > required. I don't think that will fly in the long run unless there is some > standard file format, like geotiff for GIS, that combines both. > > Chuck > > I think you hit on a very important issue. How do we save this data? The NA approach provides an inherent solution (although this does assume that the magic value is known and agreed to because it is not recorded. Meanwhile, masks are separate structures. I have been so accustomed to simply creating masks on the fly that I never bothered to save them (or I simply harden them with a fill value of NaNs or whatnot). As for a filetype that I use when I do want to save this information, I typically use NetCDF with a "missing" attribute defined or just use NaNs. The scipy.io.netcdf module could be improved to look for a "missing" attribute and set an appropriate NA value automatically, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 15:44:42 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 14:44:42 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 10:59 PM, Nathaniel Smith wrote: > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: > > On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: > >> This is a situation where I would just... use an array and a mask, > >> rather than a masked array. Then lots of things -- changing fill > >> values, temporarily masking/unmasking things, etc. -- come from free, > >> just from knowing how arrays and boolean indexing work? > > > > With a masked array, it is "for free". Why re-invent the wheel? It has > > already been done for me. > > But it's not for free at all. It's an additional concept that has to > be maintained, documented, and learned (with the last cost, which is > multiplied by the number of users, being by far the greatest). It's > not reinventing the wheel, it's saying hey, I have wheels and axles, > but what I really need the library to provide is a wheel+axle > assembly! > It feels like you're suggesting the NA bit pattern vs mask distinction and the programming interface users of NumPy see are closely tied together. This isn't the case at all, and I would like more feedback on the interface side of things irrespective of the implementation details. Please tell me what your wheel+axle assembly looks like. >> Do we really get much advantage by building all these complex > >> operations in? I worry that we're trying to anticipate and write code > >> for every situation that users find themselves in, instead of just > >> giving them some simple, orthogonal tools. > >> > > > > This is the danger, and which is why I advocate retaining the MaskedArray > > type that would provide the high-level "intelligent" operations, > meanwhile > > having in the core the basic data structures for pairing a mask with an > > array, and to recognize a special np.NA value that would act upon the > mask > > rather than the underlying data. Users would get very basic > functionality, > > while the MaskedArray would continue to provide the interface that we are > > used to. > > The interface as described is quite different... in particular, all > aggregate operations would change their behavior. > Which operations are changing, and what is the difference in behavior? I don't recall proposing something like this. My initial proposal had a difference with R for the aggregate operations, but I've changed the NEP based on your feedback. >> As a corollary, I worry that learning and keeping track of how masked > >> arrays work is more hassle than just ignoring them and writing the > >> necessary code by hand as needed. Certainly I can imagine that *if the > >> mask is a property of the data* then it's useful to have tools to keep > >> it aligned with the data through indexing and such. But some of these > >> other things are quicker to reimplement than to look up the docs for, > >> and the reimplementation is easier to read, at least for me... > > > > What you are advocating is similar to the "tried-n-true" coding practice > of > > Matlab users of using NaNs. You will hear from Matlab programmers about > how > > it is the greatest idea since sliced bread (and I was one of them). Then > I > > was introduced to Numpy, and I while I do sometimes still do the NaN > > approach, I realized that the masked array is a "better" way. > > Hey, no need to go around calling people Matlab programmers, you might > hurt someone's feelings. > > But seriously, my argument is that every abstraction and new concept > has a cost, and I'm dubious that the full masked array abstraction > carries its weight and justifies this cost, because it's highly > redundant with existing abstractions. That has nothing to do with how > tried-and-true anything is. > The abstraction is R-like missing values, and two implementation mechanisms are NA bit patterns and masks. There is no "full masked array abstraction" as a component end users will have to learn. -Mark > As for documentation, on hard/soft masks, just look at the docs for the > > MaskedArray constructor: > [...snipped...] > > Thanks! > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 15:46:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 14:46:44 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Fri, Jun 24, 2011 at 11:06 PM, Wes McKinney wrote: > On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith wrote: > > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root wrote: > >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith wrote: > >>> This is a situation where I would just... use an array and a mask, > >>> rather than a masked array. Then lots of things -- changing fill > >>> values, temporarily masking/unmasking things, etc. -- come from free, > >>> just from knowing how arrays and boolean indexing work? > >> > >> With a masked array, it is "for free". Why re-invent the wheel? It has > >> already been done for me. > > > > But it's not for free at all. It's an additional concept that has to > > be maintained, documented, and learned (with the last cost, which is > > multiplied by the number of users, being by far the greatest). It's > > not reinventing the wheel, it's saying hey, I have wheels and axles, > > but what I really need the library to provide is a wheel+axle > > assembly! > > You're communicating my argument better than I am. > > >>> Do we really get much advantage by building all these complex > >>> operations in? I worry that we're trying to anticipate and write code > >>> for every situation that users find themselves in, instead of just > >>> giving them some simple, orthogonal tools. > >>> > >> > >> This is the danger, and which is why I advocate retaining the > MaskedArray > >> type that would provide the high-level "intelligent" operations, > meanwhile > >> having in the core the basic data structures for pairing a mask with an > >> array, and to recognize a special np.NA value that would act upon the > mask > >> rather than the underlying data. Users would get very basic > functionality, > >> while the MaskedArray would continue to provide the interface that we > are > >> used to. > > > > The interface as described is quite different... in particular, all > > aggregate operations would change their behavior. > > > >>> As a corollary, I worry that learning and keeping track of how masked > >>> arrays work is more hassle than just ignoring them and writing the > >>> necessary code by hand as needed. Certainly I can imagine that *if the > >>> mask is a property of the data* then it's useful to have tools to keep > >>> it aligned with the data through indexing and such. But some of these > >>> other things are quicker to reimplement than to look up the docs for, > >>> and the reimplementation is easier to read, at least for me... > >> > >> What you are advocating is similar to the "tried-n-true" coding practice > of > >> Matlab users of using NaNs. You will hear from Matlab programmers about > how > >> it is the greatest idea since sliced bread (and I was one of them). > Then I > >> was introduced to Numpy, and I while I do sometimes still do the NaN > >> approach, I realized that the masked array is a "better" way. > > > > Hey, no need to go around calling people Matlab programmers, you might > > hurt someone's feelings. > > > > But seriously, my argument is that every abstraction and new concept > > has a cost, and I'm dubious that the full masked array abstraction > > carries its weight and justifies this cost, because it's highly > > redundant with existing abstractions. That has nothing to do with how > > tried-and-true anything is. > > +1. I think I will personally only be happy if "masked array" can be > implemented while incurring near-zero cost from the end user > perspective. If what we end up with is a faster implementation of > numpy.ma in C I'm probably going to keep on using NaN... That's why > I'm entirely insistent that whatever design be dogfooded on non-expert > users. If it's very much harder / trickier / nuanced than R, you will > have failed. > This is indeed my aim, and getting feedback on the interface before it's implemented is one of the important goals of this discussion, so that the later dogfooding doesn't raise issues that are too late in the implementation to change. -Mark > > >> As for documentation, on hard/soft masks, just look at the docs for the > >> MaskedArray constructor: > > [...snipped...] > > > > Thanks! > > > > -- Nathaniel > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Jun 25 15:51:52 2011 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 25 Jun 2011 12:51:52 -0700 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root wrote: > On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith wrote: >> I guess that is a difference, but I'm trying to get at something more >> fundamental -- not just what operations are allowed, but what >> operations people *expect* to be allowed. > > That is quite a trickier problem. It can be. I think of it as the difference between design and coding. They overlap less than one might expect... >> Here's another possible difference -- in (1), intuitively, missingness >> is a property of the data, so the logical place to put information >> about whether you can expect missing values is in the dtype, and to >> enable missing values you need to make a new array with a new dtype. >> (If we use a mask-based implementation, then >> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able >> to skip making a copy of the data -- I'm talking ONLY about the >> interface here, not whether missing data has a different storage >> format from non-missing data.) >> >> In (2), the whole point is to use different masks with the same data, >> so I'd argue masking should be a property of the array object rather >> than the dtype, and the interface should logically allow masks to be >> created, modified, and destroyed in place. >> > > I can agree with this distinction.? However, if "missingness" is an > intrinsic property of the data, then shouldn't users be implementing their > own dtype tailored to the data they are using?? In other words, how far does > the core of NumPy need to go to address this issue?? And how far would be > "too much"? Yes, that's exactly my question: whether our goal is to implement missingness in numpy or not! >> >> They're both internally consistent, but I think we might have to make >> a decision and stick to it. >> > > Of course.? I think that Mark is having a very inspired idea of giving the R > audience what they want (np.NA), while simultaneously making the use of > masked arrays even easier (which I can certainly appreciate). I don't know. I think we could build a really top-notch implementation of missingness. I also think we could build a really top-notch implementation of masking. But my suggestions for how to improve the current design are totally different depending on which of those is the goal, and neither the R audience (like me) nor the masked array audience (like you) seems really happy with the current design. And I don't know what the goal is -- maybe it's something else and the current design hits it perfectly? Maybe we want a top-notch implementation of *both* missingness and masking, and those should be two different things that can be combined, so that some of the unmasked values inside a masked array can be NA? I don't know. > I will put out a little disclaimer.? I once had to use S+ for a class.? To > be honest, it was the worst programming experience in my life.? This > experience may be coloring my perception of R's approach to handling missing > data. There's a lot of things that R does wrong (not their fault; language design is an extremely difficult and specialized skill, that statisticians are not exactly trained in), but it did make a few excellent choices at the beginning. One was to steal the execution model from Scheme, which, uh, isn't really relevant here. The other was to steal the basic data types and standard library that the Bell Labs statisticians had pounded into shape over many years. I use Python now because using R for everything would drive me crazy, but despite its many flaws, it still does some things so well that it's become *the* language used for basically all statistical research. I'm only talking about stealing those things :-). -- Nathaniel From mwwiebe at gmail.com Sat Jun 25 15:56:44 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 14:56:44 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 6:00 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 1:54 AM, Mark Wiebe wrote: > > On Fri, Jun 24, 2011 at 5:21 PM, Matthew Brett > ... > >> @Mark - I don't have a clear idea whether you consider the nafloat64 > >> option to be still in play as the first thing to be implemented > >> (before array.mask). If it is, what kind of thing would persuade you > >> either way? > > > > I'm focusing all of my effort on getting my proposal of adding a mask to > the > > core ndarray into a state where it satisfies everyone's requirements as > best > > I can. > > Maybe it would be worth setting out the requirements formally somewhere? > The design that's forming is a combination of: * Solve the missing data problem * My ideas of what a good solution looks like: * applies to all NumPy dtypes in a fully general way * high-performance, low overhead where possible * makes the C-level implementation of NumPy nicer to work with, not harder * easy to use from Python for unskilled programmers * easy to use more powerful functionality from Python for skilled programmers * satisfies all or most of the needs of the many users of arrays with a "missing data" aspect to them * All the feedback I'm getting from discussions on the list That's not a formal requirements specification, but might shed some insight. > I'm not precluding the possibility that someone could convince me > > that the na-dtype is good, but I gave it a good chunk of thought before > > starting to write the proposal. To persuade me towards the na-dtype > option, > > I need to be convinced that I'm solving the problem class in a generic > way > > that works orthogonally with other features, with manageable > implementation > > requirements, a very usable result for both strong and weak programmers, > and > > with good performance characteristics. I think the na-dtype approach > isn't > > as generic as I would like, and the implementation seems like it would be > > trickier than the masked approach. > > What I'm getting at, is that I think you have made the decision > between these two implementations some time ago while looking at the C > code. Now of course you would be a much better person to make that > decision than - say - me. It's just that, if you want coherent > feedback from us on this decision, we need to get some technical grasp > of why you made it. I realize that it will not be easy to explain > in detail, but honestly, it could be a useful discussion to have from > your and our point of view, even if it ends up in the same place. > I've updated a section "Parameterized Data Type With NA Signal Values" in the NEP with an idea for now an NA bit pattern approach could coexist and work together with the mask-based approach. I think I've solved some of the generality and implementation obstacles, it would be great to get some feedback on that. Cheers, Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:05:01 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:05:01 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 6:17 AM, Matthew Brett wrote: > Hi, > > On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe wrote: > > On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney > >> wrote: > >> ... > >> > Perhaps we should make a wiki page someplace summarizing pros and cons > >> > of the various implementation approaches? > >> > >> But - we should do this if it really is an open question which one we > >> go for. If not then, we're just slowing Mark down in getting to the > >> implementation. > >> > >> Assuming the question is still open, here's a starter for the pros and > >> cons: > >> > >> array.mask > >> 1) It's easier / neater to implement > > > > Yes > > > >> > >> 2) It can generalize across dtypes > > > > Yes > > > >> > >> 3) You can still get the masked data underneath the mask (allowing you > >> to unmask etc) > > > > By setting up views appropriately, yes. If you don't have another view to > > the underlying data, you can't get at it. > >> > >> nafloat64: > >> 1) No memory overhead > > > > Yes > > > >> > >> 2) Battle-tested implementation already done in R > > > > We can't really use that though, R is GPL and NumPy is BSD. The > low-level > > implementation details are likely different enough that a > re-implementation > > would be needed anyway. > > Right - I wasn't suggesting using the code, only that the idea can be > made to work coherently with an API that seems to have won friends > over time. > OK, so I think you mean a battle-tested implementation of the interface R exposes. That interface can be implemented with either masks or NA bit patterns, I don't believe it has anything specific to bit patterns inherent in it. > > >> I guess we'd have to test directly whether the non-continuous memory > >> of the mask and data would cause enough cache-miss problems to > >> outweigh the potential cycle-savings from single byte comparisons in > >> array.mask. > > > > The different memory buffers are each contiguous, so the access patterns > > still have a lot of coherency. I intend to give the mask memory layouts > > matching those of the arrays. > >> > >> I guess that one and only one of these will get written. I guess that > >> one of these choices may be a lot more satisfying to the current and > >> future masked array itch than the other. > > > > I'm only going to implement one solution, yes. > >> > >> I'm personally worried that the memory overhead of array.masks will > >> make many of us tend to avoid them. I work with images that can > >> easily get large enough that I would not want an array-items size byte > >> array added to my storage. > > > > May I ask what kind of dtypes and sizes you're working with? > > dtypes for images usually end up as floats - float32 or float64. On > disk, and when memory mapped, they are often int16 or uint16. Sizes > vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64) > to rather large 4D images - say 256 x 256 x 50 x 500 at the very high > end (12.5G in float64). > OK, so the mask would be an extra 128KB or 1.6G, respectively. >> The reason I'm asking for more details about the implementation is > >> because that is most of the argument for array.mask at the moment (1 > >> and 2 above). > > > > I'm first trying to nail down more of the higher level requirements > before > > digging really deep into the implementation details. They greatly affect > how > > those details have to turn out. > > Once you've started with the array.mask framework, you've committed > yourself to the memory hit, and you may lose potential users who often > hit memory limits. My guess is that no-one currently using np.ma is > in that category, because it also uses a separate mask array, as I > understand it. > In the same way, if I start with the NA bit pattern framework, I've committed to throwing away the underlying values, and I will lose potential users who want to keep them. This tradeoff goes both ways, it looks like nobody would be completely satisfied with only one of the two approaches. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:14:39 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:14:39 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 6:29 AM, Pierre GM wrote: > This thread is getting quite long, innit ? > It's tiring, yeah! > And I think it's getting a tad confusing, because we're mixing two > different concepts: missing values and masks. > There should be support for missing values in numpy.core, I think we all > agree on that. > * What's been suggested of adding new dtypes (nafloat, naint) is great, by > why not making it the default, then ? > I don't like it as the default because of the lack of generality, but I've come up with an idea that's in the NEP where the masked approach would be the default, and a parameterized type would support the NA bit pattern idea in a general fashion. > * Operations involving a NA (whatever the NA actually is, depending on the > dtype of the input) should result in a NA (whatever the NA defined by the > outputs dtype). That could be done by overloading the existing ufuncs to > support the new dtypes. > I don't want to add hundreds of new inner loops for many different dtypes. I added the float16 dtype, and doing that another 24 times to produce NA-versions of the same times would be unpleasant to say the least. I think I've got a grasp on an approach that builds on top of the existing inner loops (or slightly refactored versions) to get all these ideas working together. > * There should be some simple methods to retrieve the location of those NAs > in an array. Whether we just output the indices or a full boolean array (w/ > True for a NA, False for a non-NA or vice-versa) needs to be decided. > Maybe we need two methods. np.isna or np.ismissing, and one of np.isvalid/np.isthere/np.isavail to get access to the mask the way people want without extra copies. > * We can always re-implement masked arrays to use these NAs in a way which > would be consistent with numpy.ma (so as not to confuse existing users of > numpy.ma): a mask would be a boolean array with the same shape than the > underlying ndarray, with True for NA. > That would be doable, yes. > Mark, I'd suggest you modify your proposal, making it clearer that it's not > to add all of numpy.ma functionalities in the core, but just support these > missing values. Using the term 'mask' should be avoided as much as possible, > use a 'missing data' or whatever. > If the implementation is in terms of a mask, I think the term 'mask' should still be used where it's relevant. Maybe there is no 'mask' or 'validitymask' attribute as I've proposed, and instead np.ismissing/np.isavail are the only interface for getting at the mask. I would still want arr.flags.hasmask and arr.flags.ownmask to be there. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:22:58 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:22:58 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 9:14 AM, Wes McKinney wrote: > On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris > wrote: > > > > > > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney > wrote: > >> > >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith > wrote: > >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root > wrote: > >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith > wrote: > >> >>> This is a situation where I would just... use an array and a mask, > >> >>> rather than a masked array. Then lots of things -- changing fill > >> >>> values, temporarily masking/unmasking things, etc. -- come from > free, > >> >>> just from knowing how arrays and boolean indexing work? > >> >> > >> >> With a masked array, it is "for free". Why re-invent the wheel? It > >> >> has > >> >> already been done for me. > >> > > >> > But it's not for free at all. It's an additional concept that has to > >> > be maintained, documented, and learned (with the last cost, which is > >> > multiplied by the number of users, being by far the greatest). It's > >> > not reinventing the wheel, it's saying hey, I have wheels and axles, > >> > but what I really need the library to provide is a wheel+axle > >> > assembly! > >> > >> You're communicating my argument better than I am. > >> > >> >>> Do we really get much advantage by building all these complex > >> >>> operations in? I worry that we're trying to anticipate and write > code > >> >>> for every situation that users find themselves in, instead of just > >> >>> giving them some simple, orthogonal tools. > >> >>> > >> >> > >> >> This is the danger, and which is why I advocate retaining the > >> >> MaskedArray > >> >> type that would provide the high-level "intelligent" operations, > >> >> meanwhile > >> >> having in the core the basic data structures for pairing a mask with > >> >> an > >> >> array, and to recognize a special np.NA value that would act upon the > >> >> mask > >> >> rather than the underlying data. Users would get very basic > >> >> functionality, > >> >> while the MaskedArray would continue to provide the interface that we > >> >> are > >> >> used to. > >> > > >> > The interface as described is quite different... in particular, all > >> > aggregate operations would change their behavior. > >> > > >> >>> As a corollary, I worry that learning and keeping track of how > masked > >> >>> arrays work is more hassle than just ignoring them and writing the > >> >>> necessary code by hand as needed. Certainly I can imagine that *if > the > >> >>> mask is a property of the data* then it's useful to have tools to > keep > >> >>> it aligned with the data through indexing and such. But some of > these > >> >>> other things are quicker to reimplement than to look up the docs > for, > >> >>> and the reimplementation is easier to read, at least for me... > >> >> > >> >> What you are advocating is similar to the "tried-n-true" coding > >> >> practice of > >> >> Matlab users of using NaNs. You will hear from Matlab programmers > >> >> about how > >> >> it is the greatest idea since sliced bread (and I was one of them). > >> >> Then I > >> >> was introduced to Numpy, and I while I do sometimes still do the NaN > >> >> approach, I realized that the masked array is a "better" way. > >> > > >> > Hey, no need to go around calling people Matlab programmers, you might > >> > hurt someone's feelings. > >> > > >> > But seriously, my argument is that every abstraction and new concept > >> > has a cost, and I'm dubious that the full masked array abstraction > >> > carries its weight and justifies this cost, because it's highly > >> > redundant with existing abstractions. That has nothing to do with how > >> > tried-and-true anything is. > >> > >> +1. I think I will personally only be happy if "masked array" can be > >> implemented while incurring near-zero cost from the end user > >> perspective. If what we end up with is a faster implementation of > >> numpy.ma in C I'm probably going to keep on using NaN... That's why > >> I'm entirely insistent that whatever design be dogfooded on non-expert > >> users. If it's very much harder / trickier / nuanced than R, you will > >> have failed. > >> > > > > This sounds unduly pessimistic to me. It's one thing to suggest different > > approaches, another to cry doom and threaten to go eat worms. And all > before > > the code is written, benchmarks run, or trial made of the usefulness of > the > > approach. Let us see how things look as they get worked out. Mark has a > good > > track record for innovative tools and I'm rather curious myself to see > what > > the result is. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > I hope you're right. So far it seems that anyone who has spent real > time with R (e.g. myself, Nathaniel) has expressed serious concerns > about the masked approach. And we got into this discussion at the Data > Array summit in Austin last month because we're trying to make Python > more competitive with R viz statistical and financial applications. > I'm just trying to be (R)ealistic =P Remember that I very earnestly am > doing everything I can these days to make scientific Python more > successful in finance and statistics. One big difference with R's > approach is that we care more about performance the the R community > does. So maybe having special NA values will be prohibitive for that > reason. > > Mark indeed has a fantastic track record and I've been extremely > impressed with his NumPy work, so I've no doubt he'll do a good job. I > just hope that you don't push aside my input-- my opinions are formed > entirely based on my domain experience. > I'm trying to understand and assimilate all the feedback I'm getting. I'm also responding to nearly all the emails here, but the ones I didn't respond to were still carefully read. The perspective you're providing about users who aren't primarily programmers is incredibly important, and I'm trying to at the same time satisfy another pet peeve of mine which is systems which are designed just for people who aren't primarily programmers and dumb things down to the level that it's unusable for building serious tools. Producing something which is simultaneously good for multiple classes of users takes a lot of design iterations, which this thread is helping with tremendously. -Mark > > -W > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:16:39 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:16:39 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <20110625120044.GA26847@phare.normalesup.org> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> Message-ID: On Sat, Jun 25, 2011 at 7:00 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Sat, Jun 25, 2011 at 01:02:07AM +0100, Matthew Brett wrote: > > I'm personally worried that the memory overhead of array.masks will > > make many of us tend to avoid them. I work with images that can > > easily get large enough that I would not want an array-items size byte > > array added to my storage. > > I work with the same kind of data ( :D ). > > The way we manipulate our data, in my lab, is, to represent 3D data with > a mask to use a 1D array and a corresponding 3D mask. It reduces the > memory footprint, to the cost of loosing the 3D shape of the data. > > I am raising this because it is an option that has not been suggested so > far. I am not saying that it should be the option used to implement mask > arrays, I am just saying that there are many different ways of doing it, > and I don't think that there is a one-size-fits-all solution. > > I tend to feel like Wes: good building block to easily implement our own > solutions is what we want. > This is why I'm also proposing to add a 'mask=' parameter to ufuncs, for example, to expose the implementation details of the masked array system to people who need masks but need them to be a bit different. There may be other places this kind of thing would slip in as well. -Mark > > My 2 cents, > > Gael > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Jun 25 16:25:36 2011 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 25 Jun 2011 22:25:36 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> Message-ID: <20110625202536.GA31252@phare.normalesup.org> On Sat, Jun 25, 2011 at 03:16:39PM -0500, Mark Wiebe wrote: > This is why I'm also proposing to add a 'mask=' parameter to ufuncs, for > example, to expose the implementation details of the masked array system > to people who need masks but need them to be a bit different. There may be > other places this kind of thing would slip in as well. But I do not see how this is going to solve my use case, as I do not want to keep in memory the unmasked data, for memory usage reasons. :). Gael From mwwiebe at gmail.com Sat Jun 25 16:25:10 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:25:10 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 9:21 AM, Charles R Harris wrote: > On Sat, Jun 25, 2011 at 5:29 AM, Pierre GM wrote: > >> This thread is getting quite long, innit ? >> And I think it's getting a tad confusing, because we're mixing two >> different concepts: missing values and masks. >> There should be support for missing values in numpy.core, I think we all >> agree on that. >> * What's been suggested of adding new dtypes (nafloat, naint) is great, by >> why not making it the default, then ? >> > * Operations involving a NA (whatever the NA actually is, depending on the >> dtype of the input) should result in a NA (whatever the NA defined by the >> outputs dtype). That could be done by overloading the existing ufuncs to >> support the new dtypes. >> * There should be some simple methods to retrieve the location of those >> NAs in an array. Whether we just output the indices or a full boolean array >> (w/ True for a NA, False for a non-NA or vice-versa) needs to be decided. >> * We can always re-implement masked arrays to use these NAs in a way which >> would be consistent with numpy.ma (so as not to confuse existing users of >> numpy.ma): a mask would be a boolean array with the same shape than the >> underlying ndarray, with True for NA. >> Mark, I'd suggest you modify your proposal, making it clearer that it's >> not to add all of numpy.ma functionalities in the core, but just support >> these missing values. Using the term 'mask' should be avoided as much as >> possible, use a 'missing data' or whatever. >> > > I think he aims to support both. One complication with masks is keeping > them tied to the data on disk. With na values one file can contain both the > data and the missing data markers, whereas with masks, two files would be > required. I don't think that will fly in the long run unless there is some > standard file format, like geotiff for GIS, that combines both. > Before I was leaning mostly towards masks, but now that I've come up with an NA bit pattern approach that feels reasonable, I think implementing both together is on the table. Bringing up the file format issue is good, that hasn't been covered in the NEP yet. -Mark > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:35:54 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:35:54 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: On Sat, Jun 25, 2011 at 9:44 AM, Wes McKinney wrote: > On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris > wrote: > > On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney > wrote: > >> > >> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney > >> > wrote: > >> >> > >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith > >> >> wrote: > >> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root > >> >> > wrote: > >> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith > >> >> >> wrote: > >> >> >>> This is a situation where I would just... use an array and a > mask, > >> >> >>> rather than a masked array. Then lots of things -- changing fill > >> >> >>> values, temporarily masking/unmasking things, etc. -- come from > >> >> >>> free, > >> >> >>> just from knowing how arrays and boolean indexing work? > >> >> >> > >> >> >> With a masked array, it is "for free". Why re-invent the wheel? > It > >> >> >> has > >> >> >> already been done for me. > >> >> > > >> >> > But it's not for free at all. It's an additional concept that has > to > >> >> > be maintained, documented, and learned (with the last cost, which > is > >> >> > multiplied by the number of users, being by far the greatest). It's > >> >> > not reinventing the wheel, it's saying hey, I have wheels and > axles, > >> >> > but what I really need the library to provide is a wheel+axle > >> >> > assembly! > >> >> > >> >> You're communicating my argument better than I am. > >> >> > >> >> >>> Do we really get much advantage by building all these complex > >> >> >>> operations in? I worry that we're trying to anticipate and write > >> >> >>> code > >> >> >>> for every situation that users find themselves in, instead of > just > >> >> >>> giving them some simple, orthogonal tools. > >> >> >>> > >> >> >> > >> >> >> This is the danger, and which is why I advocate retaining the > >> >> >> MaskedArray > >> >> >> type that would provide the high-level "intelligent" operations, > >> >> >> meanwhile > >> >> >> having in the core the basic data structures for pairing a mask > >> >> >> with > >> >> >> an > >> >> >> array, and to recognize a special np.NA value that would act upon > >> >> >> the > >> >> >> mask > >> >> >> rather than the underlying data. Users would get very basic > >> >> >> functionality, > >> >> >> while the MaskedArray would continue to provide the interface that > >> >> >> we > >> >> >> are > >> >> >> used to. > >> >> > > >> >> > The interface as described is quite different... in particular, all > >> >> > aggregate operations would change their behavior. > >> >> > > >> >> >>> As a corollary, I worry that learning and keeping track of how > >> >> >>> masked > >> >> >>> arrays work is more hassle than just ignoring them and writing > the > >> >> >>> necessary code by hand as needed. Certainly I can imagine that > *if > >> >> >>> the > >> >> >>> mask is a property of the data* then it's useful to have tools to > >> >> >>> keep > >> >> >>> it aligned with the data through indexing and such. But some of > >> >> >>> these > >> >> >>> other things are quicker to reimplement than to look up the docs > >> >> >>> for, > >> >> >>> and the reimplementation is easier to read, at least for me... > >> >> >> > >> >> >> What you are advocating is similar to the "tried-n-true" coding > >> >> >> practice of > >> >> >> Matlab users of using NaNs. You will hear from Matlab programmers > >> >> >> about how > >> >> >> it is the greatest idea since sliced bread (and I was one of > them). > >> >> >> Then I > >> >> >> was introduced to Numpy, and I while I do sometimes still do the > NaN > >> >> >> approach, I realized that the masked array is a "better" way. > >> >> > > >> >> > Hey, no need to go around calling people Matlab programmers, you > >> >> > might > >> >> > hurt someone's feelings. > >> >> > > >> >> > But seriously, my argument is that every abstraction and new > concept > >> >> > has a cost, and I'm dubious that the full masked array abstraction > >> >> > carries its weight and justifies this cost, because it's highly > >> >> > redundant with existing abstractions. That has nothing to do with > how > >> >> > tried-and-true anything is. > >> >> > >> >> +1. I think I will personally only be happy if "masked array" can be > >> >> implemented while incurring near-zero cost from the end user > >> >> perspective. If what we end up with is a faster implementation of > >> >> numpy.ma in C I'm probably going to keep on using NaN... That's why > >> >> I'm entirely insistent that whatever design be dogfooded on > non-expert > >> >> users. If it's very much harder / trickier / nuanced than R, you will > >> >> have failed. > >> >> > >> > > >> > This sounds unduly pessimistic to me. It's one thing to suggest > >> > different > >> > approaches, another to cry doom and threaten to go eat worms. And all > >> > before > >> > the code is written, benchmarks run, or trial made of the usefulness > of > >> > the > >> > approach. Let us see how things look as they get worked out. Mark has > a > >> > good > >> > track record for innovative tools and I'm rather curious myself to see > >> > what > >> > the result is. > >> > > >> > Chuck > >> > > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > NumPy-Discussion at scipy.org > >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > > >> > >> I hope you're right. So far it seems that anyone who has spent real > >> time with R (e.g. myself, Nathaniel) has expressed serious concerns > >> about the masked approach. And we got into this discussion at the Data > >> Array summit in Austin last month because we're trying to make Python > >> more competitive with R viz statistical and financial applications. > >> I'm just trying to be (R)ealistic =P Remember that I very earnestly am > >> doing everything I can these days to make scientific Python more > >> successful in finance and statistics. One big difference with R's > >> approach is that we care more about performance the the R community > >> does. So maybe having special NA values will be prohibitive for that > >> reason. > >> > >> Mark indeed has a fantastic track record and I've been extremely > >> impressed with his NumPy work, so I've no doubt he'll do a good job. I > >> just hope that you don't push aside my input-- my opinions are formed > >> entirely based on my domain experience. > >> > > > > I think what we really need to see are the use cases and work flow. The > ones > > that hadn't occurred to me before were memory mapped files and data > stored > > on disk in general. I think we may need some standard format for masked > data > > on disk if we don't go the NA value route. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > Here are some things I can think of that would be affected by any changes > here > > 1) Right now users of pandas can type pandas.isnull(series[5]) and > that will yield True if the value is NA for any dtype. This might be > hard to support in the masked regime > I think this would map to np.ismissing(series[5]). What you want probably depends on whether series[5] represents a single value, a struct dtype value, or is itself an array. > 2) Functions like {Series, DataFrame}.fillna would hopefully look just > like this: > > # value is 0 or some other value to fill > new_series = self.copy() > new_series[isnull(new_series)] = value > That should work fine, yes. Keep in mind that people will write custom NA handling logic. So they might > do: > > series[isnull(other_series) & isnull(other_series2)] = val > > 3) Nulling / NA-ing out data is very common > > # null out this data up to and including date1 in these three columns > frame.ix[:date1, [col1, col2, col3]] = NaN > With np.NA instead of NaN, I think it would give what you want. > > # But this should work fine too > frame.ix[:date1, [col1, col2, col3]] = 0 > Under the hood, this would be unmasking and setting the appropriate values. > I'll try to think of some others. The main thing is that the NA value > is very easy to think about and fits in naturally with how people (at > least statistical / financial users) think about and work with data. > If you have to say "I have to set these mask locations to True" it > introduces additional mental effort compared with "I'll just set these > values to NA" > This is exactly what I mean when I'm talking about implementation details versus interface choices. With enough use cases like you've given here, I'm hoping to get that interface right. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Sat Jun 25 16:42:55 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Sat, 25 Jun 2011 15:42:55 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <20110625202536.GA31252@phare.normalesup.org> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <20110625120044.GA26847@phare.normalesup.org> <20110625202536.GA31252@phare.normalesup.org> Message-ID: On Sat, Jun 25, 2011 at 3:25 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Sat, Jun 25, 2011 at 03:16:39PM -0500, Mark Wiebe wrote: > > This is why I'm also proposing to add a 'mask=' parameter to ufuncs, > for > > example, to expose the implementation details of the masked array > system > > to people who need masks but need them to be a bit different. There > may be > > other places this kind of thing would slip in as well. > > But I do not see how this is going to solve my use case, as I do not want > to keep in memory the unmasked data, for memory usage reasons. :). > That problem requires an entirely different class of sparse storage data structures, something that could be layered on top of the NumPy arrays but I think is unfortunately out of scope for this discussion. -Mark > > Gael > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From efiring at hawaii.edu Sat Jun 25 17:42:52 2011 From: efiring at hawaii.edu (Eric Firing) Date: Sat, 25 Jun 2011 11:42:52 -1000 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: <4E062DF8.5010409@hawaii.edu> Message-ID: <4E06565C.9070802@hawaii.edu> On 06/25/2011 09:09 AM, Benjamin Root wrote: > > > On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith > wrote: > > On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing > wrote: > > On 06/25/2011 07:05 AM, Nathaniel Smith wrote: > >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew > Brett> wrote: > >>> To clarify, you're proposing for: > >>> > >>> a = np.sum(np.array([np.NA, np.NA]) > >>> > >>> 1) -> np.NA > >>> 2) -> 0.0 > >> > >> Yes -- and in R you get actually do get NA, while in numpy.ma > you > >> actually do get 0. I don't think this is a coincidence; I think it's > > > > No, you don't: > > > > In [2]: np.ma.array([2, 4], mask=[True, True]).sum() > > Out[2]: masked > > > > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) > > Out[4]: masked > > Huh. So in numpy.ma , sum([10, NA]) and sum([10]) > are the same, but > sum([NA]) and sum([]) are different? Sounds to me like you should file > a bug on numpy.ma... > > > Actually, no... I should have tested this before replying earlier: > > >>> a = np.ma.array([2, 4], mask=[True, True]) > >>> a > masked_array(data = [-- --], > mask = [ True True], > fill_value = 999999) > > >>> a.sum() > masked > >>> a = np.ma.array([], mask=[]) > >>> a > >>> a > masked_array(data = [], > mask = [], > fill_value = 1e+20) > >>> a.sum() > masked > > They are the same. > > > Anyway, the general point is that in R, NA's propagate, and in > numpy.ma , masked values are ignored (except, > apparently, if all values > are masked). Here, I actually checked these: > > Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4 > R: sum(c(NA, 4)) -> NA > > > If you want NaN behavior, then use NaNs. If you want masked behavior, > then use masks. But I think that where Mark is heading is towards infrastructure that makes it easy and efficient to do either, as needed, case by case, line by line, for any dtype--not just floats. If he can succeed, that helps all of us. This doesn't have to be "R versus masked arrays", or beginners versus experienced programmers. Eric > > Ben Root > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From wesmckinn at gmail.com Sat Jun 25 17:56:14 2011 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 25 Jun 2011 17:56:14 -0400 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: Message-ID: On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root wrote: >> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith wrote: >>> I guess that is a difference, but I'm trying to get at something more >>> fundamental -- not just what operations are allowed, but what >>> operations people *expect* to be allowed. >> >> That is quite a trickier problem. > > It can be. I think of it as the difference between design and coding. > They overlap less than one might expect... > >>> Here's another possible difference -- in (1), intuitively, missingness >>> is a property of the data, so the logical place to put information >>> about whether you can expect missing values is in the dtype, and to >>> enable missing values you need to make a new array with a new dtype. >>> (If we use a mask-based implementation, then >>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able >>> to skip making a copy of the data -- I'm talking ONLY about the >>> interface here, not whether missing data has a different storage >>> format from non-missing data.) >>> >>> In (2), the whole point is to use different masks with the same data, >>> so I'd argue masking should be a property of the array object rather >>> than the dtype, and the interface should logically allow masks to be >>> created, modified, and destroyed in place. >>> >> >> I can agree with this distinction.? However, if "missingness" is an >> intrinsic property of the data, then shouldn't users be implementing their >> own dtype tailored to the data they are using?? In other words, how far does >> the core of NumPy need to go to address this issue?? And how far would be >> "too much"? > > Yes, that's exactly my question: whether our goal is to implement > missingness in numpy or not! > >>> >>> They're both internally consistent, but I think we might have to make >>> a decision and stick to it. >>> >> >> Of course.? I think that Mark is having a very inspired idea of giving the R >> audience what they want (np.NA), while simultaneously making the use of >> masked arrays even easier (which I can certainly appreciate). > > I don't know. I think we could build a really top-notch implementation > of missingness. I also think we could build a really top-notch > implementation of masking. But my suggestions for how to improve the > current design are totally different depending on which of those is > the goal, and neither the R audience (like me) nor the masked array > audience (like you) seems really happy with the current design. And I > don't know what the goal is -- maybe it's something else and the > current design hits it perfectly? Maybe we want a top-notch > implementation of *both* missingness and masking, and those should be > two different things that can be combined, so that some of the > unmasked values inside a masked array can be NA? I don't know. > >> I will put out a little disclaimer.? I once had to use S+ for a class.? To >> be honest, it was the worst programming experience in my life.? This >> experience may be coloring my perception of R's approach to handling missing >> data. > > There's a lot of things that R does wrong (not their fault; language > design is an extremely difficult and specialized skill, that > statisticians are not exactly trained in), but it did make a few > excellent choices at the beginning. One was to steal the execution > model from Scheme, which, uh, isn't really relevant here. The other > was to steal the basic data types and standard library that the Bell > Labs statisticians had pounded into shape over many years. I use > Python now because using R for everything would drive me crazy, but > despite its many flaws, it still does some things so well that it's > become *the* language used for basically all statistical research. I'm > only talking about stealing those things :-). > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > +1. Everyone knows R ain't perfect. I think it's an atrociously bad programming language but it can be unbelievably good at statistics, as evidenced by its success. Brings to mind Andy Gelman's blog last fall: http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html As someone in a statistics department I've frequently been disheartened when I see how easy many statistical things are in R and how much more difficult they are in Python. This is partially the result of poor interfaces for statistical modeling, partially due to data structures (e.g. the integrated-ness of data.frame throughout R) and things like handling of missing data of which there's currently no equivalent. - Wes From chaoyuejoy at gmail.com Sun Jun 26 14:48:26 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Sun, 26 Jun 2011 20:48:26 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt Message-ID: Dear all numpy users, I want to read a csv file with many (49) columns, the first column is string and remaning can be float. how can I avoid type in like data=numpy.genfromtxt('data.csv',delimiter=';',names=True, dtype=(S10, float, float, ......)) Can I just specify the type of first cloumn is tring and the remaing float? how can I do that? Thanks a lot, Chao -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Sun Jun 26 15:27:59 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Sun, 26 Jun 2011 21:27:59 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: References: Message-ID: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> On 26.06.2011, at 8:48PM, Chao YUE wrote: > I want to read a csv file with many (49) columns, the first column is string and remaning can be float. > how can I avoid type in like > > data=numpy.genfromtxt('data.csv',delimiter=';',names=True, dtype=(S10, float, float, ......)) > > Can I just specify the type of first cloumn is tring and the remaing float? how can I do that? Simply use 'dtype=None' to let genfromtxt automatically determine the type (it is perhaps a bit confusing that this is not the default - maybe it should be repeated in the docstring for clarity that the default is for dtype is 'float'...). Also, a shorter way of typing the dtype above (e.g. in case some columns would be auto-detected as int) would be ['S10'] + [ float for n in range(48) ] HTH, Derek From ben.root at ou.edu Sun Jun 26 15:56:46 2011 From: ben.root at ou.edu (Benjamin Root) Date: Sun, 26 Jun 2011 14:56:46 -0500 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> References: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> Message-ID: On Sun, Jun 26, 2011 at 2:27 PM, Derek Homeier < derek at astro.physik.uni-goettingen.de> wrote: > On 26.06.2011, at 8:48PM, Chao YUE wrote: > > > I want to read a csv file with many (49) columns, the first column is > string and remaning can be float. > > how can I avoid type in like > > > > data=numpy.genfromtxt('data.csv',delimiter=';',names=True, dtype=(S10, > float, float, ......)) > > > > Can I just specify the type of first cloumn is tring and the remaing > float? how can I do that? > > Simply use 'dtype=None' to let genfromtxt automatically determine the type > (it is perhaps a bit confusing that this is not the default - maybe it > should be repeated in the docstring for clarity that the default is for > dtype is 'float'...). > Also, a shorter way of typing the dtype above (e.g. in case some columns > would be auto-detected as int) would be > ['S10'] + [ float for n in range(48) ] > > Another possibility is -- if you don't want the first column string data -- is to basically ignore that column by specifying the "usecol" parameter. Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From chaoyuejoy at gmail.com Sun Jun 26 16:28:56 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Sun, 26 Jun 2011 22:28:56 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> References: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> Message-ID: *Hi Derek, Thanks very much for your quick reply. I make a short summary of what I've tried. Actually the *['S10'] + [ float for n in range(48) ] *only* *works when you explicitly specify the columns to be read, and genfromtxt cannot automatically determine the type* *if you don't specify the type.... I also have a problem with the missing value which I described at the end of this mail. Sorry for the very long example.... Thanks again, * In [164]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] + [ float for n in range(48)]) In [165]: b Out[165]: array([ ('01/01/2003', -999.0, -1.028, -999.0, -999.0, -999.0, -999.0, -999.0, - 25.368400000000001, 0.75920799999999999, -25.425699999999999, 0.7763219999999999 6, -25.220500000000001, 0.77561899999999995, 0.20000000000000001, 280.089, 0.574 58299999999995, 0.417018, -0.042441800000000002, 0.0428254, -0.18517600000000001 , -0.056775800000000001, 93.721299999999999, -8.1318099999999998, -9.5244, -9.93 23200000000007, -10.2728, -20.945499999999999, -8.4939999999999998, -9.567819999 9999993, -9.9175500000000003, -9.7835400000000003, -10.4445, -999.0, -999.0, -99 9.0, -999.0, -999.0, -2.80863, -6.7711100000000002, -999.0, -999.0, -999.0, 0.10 9, 0.075999999999999998, 0.10000000000000001, 0.074999999999999997, 0.0, -999.0), ('01/01/2003', -999.0, -0.40899999999999997, -999.0, -999.0, -999.0, -999 .0, -999.0, -25.3233, 0.75929800000000003, -25.368600000000001, 0.77451599999999 998, -25.118400000000001, 0.77264200000000005, 0.20499999999999999, 267.80599999 999998, 0.59291700000000003, 0.42051699999999997, -0.037141399999999998, 0.04043 3200000000002, -0.16375999999999999, -0.029456400000000001, 93.749099999999999, -8.1292799999999996, -9.5213800000000006, -9.9336199999999995, -10.2749000000000 01, -21.1402, -8.4918899999999997, -9.5663699999999992, -9.9207000000000001, -9. 7896099999999997, -10.4514, -999.0, -999.0, -999.0, -999.0, -999.0, -2.8468, -6. 7986899999999997, -999.0, -999.0, -999.0, 0.109, 0.075999999999999998, 0.1000000 0000000001, 0.074999999999999997, 0.0, -999.0), .... dtype=[('TIMESTAMP', '|S10'), ('CO2_flux', ' in () C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, dtype, co mments, delimiter, skiprows, skip_header, skip_footer, converters, missing, miss ing_values, filling_values, usecols, names, excludelist, deletechars, replace_sp ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_rais e) 1449 # Raise an exception ? 1450 if invalid_raise: -> 1451 raise ValueError(errmsg) 1452 # Issue a warning ? 1453 else: ValueError * If I don't specify the dtype, it will not recognize the type of the first column (it displays as nan):* In [172]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) In [173]: b Out[173]: array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), (nan, -999.0, -999.0), (nan, -999.0, -999.0)], dtype=[('TIMESTAMP', ' > On 26.06.2011, at 8:48PM, Chao YUE wrote: > > > I want to read a csv file with many (49) columns, the first column is > string and remaning can be float. > > how can I avoid type in like > > > > data=numpy.genfromtxt('data.csv',delimiter=';',names=True, dtype=(S10, > float, float, ......)) > > > > Can I just specify the type of first cloumn is tring and the remaing > float? how can I do that? > > Simply use 'dtype=None' to let genfromtxt automatically determine the type > (it is perhaps a bit confusing that this is not the default - maybe it > should be repeated in the docstring for clarity that the default is for > dtype is 'float'...). > Also, a shorter way of typing the dtype above (e.g. in case some columns > would be auto-detected as int) would be > ['S10'] + [ float for n in range(48) ] > > HTH, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From oliphant at enthought.com Sun Jun 26 01:25:40 2011 From: oliphant at enthought.com (Travis Oliphant) Date: Sun, 26 Jun 2011 00:25:40 -0500 Subject: [Numpy-discussion] Concepts for masked/missing data In-Reply-To: References: <4E062DF8.5010409@hawaii.edu> Message-ID: <04F8CFF2-9798-40B2-80FA-A6CA31F96CCD@enthought.com> I haven't commented yet on the mailing list because of time pressures although I have spoken to Mark as often as I can --- and have encouraged him to pursue his ideas and discuss them with the community. The Numeric Python discussion list has a long history of great dialogue to try and bring out as many perspectives as possible as we wrestle with improving the code base. It is very encouraging to see that tradition continuing. Because Enthought was mentioned earlier in this thread, I would like to try and clarify a few things about my employer and the company's interest. Enthought has been very interested in the development of the NumPy and SciPy stack (and the broader SciPy community) for sometime. With its limited resources, Enthought helped significantly to form the SciPy community and continues to sponsor it as much as it can. Many developers who work at Enthought (including me) have personal interest in the NumPy / SciPy community and codebase that go beyond Enthought's ability to invest directly as well. While Enthought has limited resources to invest directly in pursuing the goal, Enthought is very interested in improving Python's use as a data analysis environment. Because of that interest, Enthought sponsored a "data-array" summit in May. There is an inScight podcast that summarizes some of the event that you can listen to at http://inscight.org/2011/05/18/episode_13/. The purpose of this event was to bring a few people together who have been working on different aspects of the problem (particularly around the labelled array, or data array problem). We also wanted to jump start the activity of our interns and make sure that some of the use cases we have seen during the past several years while working on client projects had some light. The event was successful in that it generated *a lot* of ideas. Some of these ideas were summarized in notes that are linked to at this convore thread: https://convore.com/python-scientific-computing/data-array-in-numpy/ One of the major ideas that emerged during the discussion is that NumPy needs to be able to handle missing data in a more integrated way (i.e. there need to be functions that do the "right" thing in the face of missing data). One approach that was suggested during some of the discussion was that one way to handle missing data would be to introducing special nadtypes. Mark is one of 2 interns that we have this summer who are tasked at a high level with taking what was learned at the summit and implementing critical pieces as their skills and interests allow. I have been talking with them individually to map out specific work targets for the summer. Christopher Jordan-Squires is one of our interns who is studying to get a PhD in Mathematics at the Univ. of Washington. He has a strong interest in statistics and a desire to make Python as easy to use as R for certain statistical work flows. Mark Wiebe is known on this list because of his recent success at working on the NumPy code base. As a result of that success, Mark is working on making improvements to NumPy that are seen as most critical to solving some of the same problems we keep seeing in our projects (labeled arrays being one of them). We are also very interested in the Pandas project as it brings a data-structure like the successful DataFrame in R to the Python space (and it helps solve some of the problems our clients are seeing). It would be good to make sure that core functionality that Pandas needs is available in NumPy where appropriate. The date-time work that Mark did was the first "low-hanging" fruit that needed to be finished. The second project that Mark is involved with is creating an approach for missing data in NumPy. I suggested the missing data dtypes (in part because Mark had expressed some concerns about the way dtypes are handled in NumPy and I would love for that mechanism for user-defined data-types and the whole data-type infrastructure to be improved as needed.) Mark spent some time thinking about it and felt more comfortable with the masked array solution and that is where we are now. Enthought's main interest remains in seeing how much of the data array can and should be moved into low-level NumPy as well as the implementation of functionality (wherever it may live) to make data analysis easier and more productive in Python. Again, though, this is something Enthought as a company can only invest limited resources in, and we want to make sure that Mark spends the time that we are sponsoring doing work that is seen as valuable by the community but more importantly matching our own internal needs. I will post a follow-on message that provides my current views on the subject of missing data and masked arrays. -Travis On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote: > > > On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith wrote: > On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing wrote: > > On 06/25/2011 07:05 AM, Nathaniel Smith wrote: > >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett wrote: > >>> To clarify, you're proposing for: > >>> > >>> a = np.sum(np.array([np.NA, np.NA]) > >>> > >>> 1) -> np.NA > >>> 2) -> 0.0 > >> > >> Yes -- and in R you get actually do get NA, while in numpy.ma you > >> actually do get 0. I don't think this is a coincidence; I think it's > > > > No, you don't: > > > > In [2]: np.ma.array([2, 4], mask=[True, True]).sum() > > Out[2]: masked > > > > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True])) > > Out[4]: masked > > Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but > sum([NA]) and sum([]) are different? Sounds to me like you should file > a bug on numpy.ma... > > Actually, no... I should have tested this before replying earlier: > > >>> a = np.ma.array([2, 4], mask=[True, True]) > >>> a > masked_array(data = [-- --], > mask = [ True True], > fill_value = 999999) > > >>> a.sum() > masked > >>> a = np.ma.array([], mask=[]) > >>> a > >>> a > masked_array(data = [], > mask = [], > fill_value = 1e+20) > >>> a.sum() > masked > > They are the same. > > > Anyway, the general point is that in R, NA's propagate, and in > numpy.ma, masked values are ignored (except, apparently, if all values > are masked). Here, I actually checked these: > > Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4 > R: sum(c(NA, 4)) -> NA > > > If you want NaN behavior, then use NaNs. If you want masked behavior, then use masks. > > Ben Root > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion --- Travis Oliphant Enthought, Inc. oliphant at enthought.com 1-512-536-1057 http://www.enthought.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 27 11:55:45 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 27 Jun 2011 10:55:45 -0500 Subject: [Numpy-discussion] missing data discussion round 2 Message-ID: First I'd like to thank everyone for all the feedback you're providing, clearly this is an important topic to many people, and the discussion has helped clarify the ideas for me. I've renamed and updated the NEP, then placed it into the master NumPy repository so it has a more permanent home here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst In the NEP, I've tried to address everything that was raised in the original thread and in Nathaniel's followup 'Concepts' thread. To deal with the issue of whether a mask is True or False for a missing value, I've removed the 'mask' attribute entirely, except for ufunc-like functions np.ismissing and np.isavail which return the two styles of masks. Here's a high level summary of how I'm thinking of the topic, and what I will implement: *Missing Data Abstraction* There appear to be two useful ways to think about missing data that are worth supporting. 1) Unknown yet existing data 2) Data that doesn't exist In 1), an NA value causes outputs to become NA except in a small number of exceptions such as boolean logic, and in 2), operations treat the data as if there were a smaller array without the NA values. *Temporarily Ignoring Data* * * In some cases, it is useful to flag data as NA temporarily, possibly in several different ways, for particular calculations or testing out different ways of throwing away outliers. This is independent of the missing data abstraction, still requiring a choice of 1) or 2) above. *Implementation Techniques* * * There are two mechanisms generally used to implement missing data abstractions, * * 1) An NA bit pattern 2) A mask I've described a design in the NEP which can include both techniques using the same interface. The mask approach is strictly more general than the NA bit pattern approach, except for a few things like the idea of supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP. My intention is to implement the mask-based design, and possibly also implement the NA bit pattern design, but if anything gets cut it will be the NA bit patterns. Thanks again for all your input so far, and thanks in advance for your suggestions for improving this new revision of the NEP. -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 27 12:17:45 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Mon, 27 Jun 2011 18:17:45 +0200 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: <4E00E455.2070604@noaa.gov> References: <4E00E455.2070604@noaa.gov> Message-ID: On 21.06.2011, at 8:35PM, Christopher Barker wrote: > Robert Kern wrote: >> https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt > > Just a note. From that doc: > > """ > HDF5 is a complicated format that more or less implements > a hierarchical filesystem-in-a-file. This fact makes satisfying > some of the Requirements difficult. To the author's knowledge, as > of this writing, there is no application or library that reads or > writes even a subset of HDF5 files that does not use the canonical > libhdf5 implementation. > """ > > I'm pretty sure that the NetcdfJava libs, developed by Unidata, use > their own home-grown code. netcdf4 is built on HDF5, so that qualifies > as "a library that reads or writes a subset of HDF5 files". Perhaps > there are lessons to be learned there. (too bad it's Java) > > """ > Furthermore, by > providing the first non-libhdf5 implementation of HDF5, we would > be able to encourage more adoption of simple HDF5 in applications > where it was previously infeasible because of the size of the > library. > """ > > I suppose this point is still true -- a C lib that supported a subset of > hdf would be nice. > > That being said, I like the simplicity of the .npy format, and I don't > know that anyone wants to take any of this on anyway. Some late comments on the note (I was a bit surprised that HDF5 installation seems to be a serious hurdle to many - maybe I've just been profiting from the fink build system for OS X here - but I also was not aware that the current netCDF is built on downwards-compatibility to the HDF5 standard, something useful learnt again...:-) Some more confusion arose when finding that the NCAR netCDF includes C and Fortran versions: http://www.unidata.ucar.edu/software/netcdf/ but they also depend actually on HDF5 for netCDF 4 access. While the Java version appears not to, it also only provides *read* access to those formats, so it probably would not be of that much help anyway. The netCDF4-Python package mentioned before http://code.google.com/p/netcdf4-python/ unfortunately builds on HDF5 again, same for the PyNIO module http://www.pyngl.ucar.edu/Nio.shtml which is probably explained by the above dependencies. Finally, the former Scientific.IO NetCDF interface is now part of scipy.io, but I assume it only supports netCDF 3 (the documentation is not specific about that). This might be the easiest option for a portable data format (if Matlab supports it). Cheers, Derek From robert.kern at gmail.com Mon Jun 27 12:33:39 2011 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 27 Jun 2011 11:33:39 -0500 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: <4E00E455.2070604@noaa.gov> References: <4E00E455.2070604@noaa.gov> Message-ID: On Tue, Jun 21, 2011 at 13:35, Christopher Barker wrote: > Robert Kern wrote: >> https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt > > Just a note. From that doc: > > """ > ? ? HDF5 is a complicated format that more or less implements > ? ? a hierarchical filesystem-in-a-file. ?This fact makes satisfying > ? ? some of the Requirements difficult. ?To the author's knowledge, as > ? ? of this writing, there is no application or library that reads or > ? ? writes even a subset of HDF5 files that does not use the canonical > ? ? libhdf5 implementation. > """ > > I'm pretty sure that the NetcdfJava libs, developed by Unidata, use > their own home-grown code. netcdf4 is built on HDF5, so that qualifies > as "a library that reads or writes a subset of HDF5 files". Perhaps > there are lessons to be learned there. (too bad it's Java) Good to know! The NEP predates the release of NetcdfJava 4.0, the first to include the HDF5 support, by about 6 months, so that's why it says what it does. :-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From robert.kern at gmail.com Mon Jun 27 12:36:53 2011 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 27 Jun 2011 11:36:53 -0500 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: <4E00E455.2070604@noaa.gov> Message-ID: On Mon, Jun 27, 2011 at 11:17, Derek Homeier wrote: > On 21.06.2011, at 8:35PM, Christopher Barker wrote: > >> Robert Kern wrote: >>> https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt >> >> Just a note. From that doc: >> >> """ >> ? ? HDF5 is a complicated format that more or less implements >> ? ? a hierarchical filesystem-in-a-file. ?This fact makes satisfying >> ? ? some of the Requirements difficult. ?To the author's knowledge, as >> ? ? of this writing, there is no application or library that reads or >> ? ? writes even a subset of HDF5 files that does not use the canonical >> ? ? libhdf5 implementation. >> """ >> >> I'm pretty sure that the NetcdfJava libs, developed by Unidata, use >> their own home-grown code. netcdf4 is built on HDF5, so that qualifies >> as "a library that reads or writes a subset of HDF5 files". Perhaps >> there are lessons to be learned there. (too bad it's Java) > Some late comments on the note (I was a bit surprised that HDF5 installation seems to be a serious hurdle to many - maybe I've just been profiting from the fink build system for OS X here - but I also was not aware that the current netCDF is built on downwards-compatibility to the HDF5 standard, something useful learnt again...:-) It's not so much that it's hard to build for lots of people. Rather, it would be quite difficult to include into numpy itself, particularly if we are just relying on distutils. numpy is too "fundamental" of a package to have extra dependencies. > Some more confusion arose when finding that the NCAR netCDF includes C and Fortran versions: > http://www.unidata.ucar.edu/software/netcdf/ > but they also depend actually on HDF5 for netCDF 4 access. While the Java version appears not to, it also only provides *read* access to those formats, so it probably would not be of that much help anyway. Also good to know! > The netCDF4-Python package mentioned before > http://code.google.com/p/netcdf4-python/ > unfortunately builds on HDF5 again, same for the PyNIO module > http://www.pyngl.ucar.edu/Nio.shtml > which is probably explained by the above dependencies. > > Finally, the former Scientific.IO NetCDF interface is now part of scipy.io, but I assume it only supports netCDF 3 (the documentation is not specific about that). This might be the easiest option for a portable data format (if Matlab supports it). Yes, it is NetCDF 3. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." ? -- Umberto Eco From derek at astro.physik.uni-goettingen.de Mon Jun 27 12:52:11 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Mon, 27 Jun 2011 18:52:11 +0200 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: References: <4E00E455.2070604@noaa.gov> Message-ID: <2F6744DC-6B51-469A-87CD-C09409D5C998@astro.physik.uni-goettingen.de> On 27.06.2011, at 6:36PM, Robert Kern wrote: >> Some late comments on the note (I was a bit surprised that HDF5 installation seems to be a serious hurdle to many - maybe I've just been profiting from the fink build system for OS X here - but I also was not aware that the current netCDF is built on downwards-compatibility to the HDF5 standard, something useful learnt again...:-) > > It's not so much that it's hard to build for lots of people. Rather, > it would be quite difficult to include into numpy itself, particularly > if we are just relying on distutils. numpy is too "fundamental" of a > package to have extra dependencies. I completely agree with that - I doubt that there is a standard data format that is both in wide use and simple enough to be easily included with numpy. Pyfits might be possible, but that again addresses probably a too specific user group. Cheers, Derek From charlesr.harris at gmail.com Mon Jun 27 12:53:38 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 27 Jun 2011 10:53:38 -0600 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe wrote: > First I'd like to thank everyone for all the feedback you're providing, > clearly this is an important topic to many people, and the discussion has > helped clarify the ideas for me. I've renamed and updated the NEP, then > placed it into the master NumPy repository so it has a more permanent home > here: > > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > > In the NEP, I've tried to address everything that was raised in the > original thread and in Nathaniel's followup 'Concepts' thread. To deal with > the issue of whether a mask is True or False for a missing value, I've > removed the 'mask' attribute entirely, except for ufunc-like functions > np.ismissing and np.isavail which return the two styles of masks. Here's a > high level summary of how I'm thinking of the topic, and what I will > implement: > > *Missing Data Abstraction* > > There appear to be two useful ways to think about missing data that are > worth supporting. > > 1) Unknown yet existing data > 2) Data that doesn't exist > > In 1), an NA value causes outputs to become NA except in a small number of > exceptions such as boolean logic, and in 2), operations treat the data as if > there were a smaller array without the NA values. > > *Temporarily Ignoring Data* > * > * > In some cases, it is useful to flag data as NA temporarily, possibly in > several different ways, for particular calculations or testing out different > ways of throwing away outliers. This is independent of the missing data > abstraction, still requiring a choice of 1) or 2) above. > > *Implementation Techniques* > * > * > There are two mechanisms generally used to implement missing data > abstractions, > * > * > 1) An NA bit pattern > 2) A mask > > I've described a design in the NEP which can include both techniques using > the same interface. The mask approach is strictly more general than the NA > bit pattern approach, except for a few things like the idea of supporting > the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > > My intention is to implement the mask-based design, and possibly also > implement the NA bit pattern design, but if anything gets cut it will be the > NA bit patterns. > > I have the impression that the mask-based design would be easier. Perhaps you could do that one first and folks could try out the API and see how they like it and discover whether the memory overhead is a problem in practice. Some discussion of disk storage might also help. I don't see how the rules can be enforced if two files are used, one for the mask and another for the data, but that may just be something we need to live with. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.becker at amolf.nl Mon Jun 27 13:11:36 2011 From: n.becker at amolf.nl (Nils Becker) Date: Mon, 27 Jun 2011 19:11:36 +0200 Subject: [Numpy-discussion] fast numpy i/o Message-ID: <4E08B9C8.8030905@amolf.nl> Hi, >> Finally, the former Scientific.IO NetCDF interface is now part of >> scipy.io, but I assume it only supports netCDF 3 (the documentation >> is not specific about that). This might be the easiest option for a >> portable data format (if Matlab supports it). > Yes, it is NetCDF 3. In recent versions of Scientific, NetCDF 4 can be written and read, see http://dirac.cnrs-orleans.fr/ScientificPython/ScientificPythonManual/Scientific.IO.NetCDF.NetCDFFile-class.html N. From matthew.brett at gmail.com Mon Jun 27 13:18:57 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Mon, 27 Jun 2011 18:18:57 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Mon, Jun 27, 2011 at 5:53 PM, Charles R Harris wrote: > > > On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe wrote: >> >> First I'd like to thank everyone for all the feedback you're providing, >> clearly this is an important topic to many people, and the discussion has >> helped clarify the ideas for me. I've renamed and updated the NEP, then >> placed it into the master NumPy repository so it has a more permanent home >> here: >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >> In the NEP, I've tried to address everything that was raised in the >> original thread and in Nathaniel's followup 'Concepts' thread. To deal with >> the issue of whether a mask is True or False for a missing value, I've >> removed the 'mask' attribute entirely, except for ufunc-like functions >> np.ismissing and np.isavail which return the two styles of masks. Here's a >> high level summary of how I'm thinking of the topic, and what I will >> implement: >> Missing Data Abstraction >> There appear to be two useful ways to think about missing data that are >> worth supporting. >> 1) Unknown yet existing data >> 2) Data that doesn't exist >> In 1), an NA value causes outputs to become NA except in a small number of >> exceptions such as boolean logic, and in 2), operations treat the data as if >> there were a smaller array without the NA values. >> Temporarily Ignoring Data >> In some cases, it is useful to flag data as NA temporarily, possibly in >> several different ways, for particular calculations or testing out different >> ways of throwing away outliers. This is independent of the missing data >> abstraction, still requiring a choice of 1) or 2) above. >> Implementation Techniques >> There are two mechanisms generally used to implement missing data >> abstractions, >> 1) An NA bit pattern >> 2) A mask >> I've described a design in the NEP which can include both techniques using >> the same interface. The mask approach is strictly more general than the NA >> bit pattern approach, except for a few things like the idea of supporting >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP. >> My intention is to implement the mask-based design, and possibly also >> implement the NA bit pattern design, but if anything gets cut it will be the >> NA bit patterns. > > I have the impression that the mask-based design would be easier. Perhaps > you could do that one first and folks could try out the API and see how they > like it and discover whether the memory overhead is a problem in practice. That seems like a risky strategy to me, as the most likely outcome is that people worried about memory will avoid masked arrays because they know they use more memory. The memory usage is predictable and we won't learn any more about it from use. We most of us already know if we're having to optimize code for memory. You won't get complaints, you'll just lose a group of users, who will, I suspect, stick to NaNs, unsatisfactory as they are. See you, Matthew From e.antero.tammi at gmail.com Mon Jun 27 13:32:52 2011 From: e.antero.tammi at gmail.com (eat) Date: Mon, 27 Jun 2011 20:32:52 +0300 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 8:18 PM, Matthew Brett wrote: > Hi, > > On Mon, Jun 27, 2011 at 5:53 PM, Charles R Harris > wrote: > > > > > > On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe wrote: > >> > >> First I'd like to thank everyone for all the feedback you're providing, > >> clearly this is an important topic to many people, and the discussion > has > >> helped clarify the ideas for me. I've renamed and updated the NEP, then > >> placed it into the master NumPy repository so it has a more permanent > home > >> here: > >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > >> In the NEP, I've tried to address everything that was raised in the > >> original thread and in Nathaniel's followup 'Concepts' thread. To deal > with > >> the issue of whether a mask is True or False for a missing value, I've > >> removed the 'mask' attribute entirely, except for ufunc-like functions > >> np.ismissing and np.isavail which return the two styles of masks. Here's > a > >> high level summary of how I'm thinking of the topic, and what I will > >> implement: > >> Missing Data Abstraction > >> There appear to be two useful ways to think about missing data that are > >> worth supporting. > >> 1) Unknown yet existing data > >> 2) Data that doesn't exist > >> In 1), an NA value causes outputs to become NA except in a small number > of > >> exceptions such as boolean logic, and in 2), operations treat the data > as if > >> there were a smaller array without the NA values. > >> Temporarily Ignoring Data > >> In some cases, it is useful to flag data as NA temporarily, possibly in > >> several different ways, for particular calculations or testing out > different > >> ways of throwing away outliers. This is independent of the missing data > >> abstraction, still requiring a choice of 1) or 2) above. > >> Implementation Techniques > >> There are two mechanisms generally used to implement missing data > >> abstractions, > >> 1) An NA bit pattern > >> 2) A mask > >> I've described a design in the NEP which can include both techniques > using > >> the same interface. The mask approach is strictly more general than the > NA > >> bit pattern approach, except for a few things like the idea of > supporting > >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > >> My intention is to implement the mask-based design, and possibly also > >> implement the NA bit pattern design, but if anything gets cut it will be > the > >> NA bit patterns. > > > > I have the impression that the mask-based design would be easier. Perhaps > > you could do that one first and folks could try out the API and see how > they > > like it and discover whether the memory overhead is a problem in > practice. > > That seems like a risky strategy to me, as the most likely outcome is > that people worried about memory will avoid masked arrays because they > know they use more memory. The memory usage is predictable and we > won't learn any more about it from use. We most of us already know if > we're having to optimize code for memory. > > You won't get complaints, you'll just lose a group of users, who will, > I suspect, stick to NaNs, unsatisfactory as they are. > +1 - eat > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From derek at astro.physik.uni-goettingen.de Mon Jun 27 13:34:59 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Mon, 27 Jun 2011 19:34:59 +0200 Subject: [Numpy-discussion] fast numpy i/o In-Reply-To: <4E08B9C8.8030905@amolf.nl> References: <4E08B9C8.8030905@amolf.nl> Message-ID: On 27.06.2011, at 7:11PM, Nils Becker wrote: >>> Finally, the former Scientific.IO NetCDF interface is now part of >>> scipy.io, but I assume it only supports netCDF 3 (the documentation >>> is not specific about that). This might be the easiest option for a >>> portable data format (if Matlab supports it). > >> Yes, it is NetCDF 3. > > In recent versions of Scientific, NetCDF 4 can be written and read, see > > http://dirac.cnrs-orleans.fr/ScientificPython/ScientificPythonManual/Scientific.IO.NetCDF.NetCDFFile-class.html Quite remarkable, since from the Scientific home page it was not obvious the package had been developed at all in the last 4 years or even made the transition from Numeric! But apparently 2.9.x is doing so; however it also requires the Unidata netCDF library and everything that comes with that. Cheers, Derek From e.antero.tammi at gmail.com Mon Jun 27 13:44:21 2011 From: e.antero.tammi at gmail.com (eat) Date: Mon, 27 Jun 2011 20:44:21 +0300 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: > First I'd like to thank everyone for all the feedback you're providing, > clearly this is an important topic to many people, and the discussion has > helped clarify the ideas for me. I've renamed and updated the NEP, then > placed it into the master NumPy repository so it has a more permanent home > here: > > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > > In the NEP, I've tried to address everything that was raised in the > original thread and in Nathaniel's followup 'Concepts' thread. To deal with > the issue of whether a mask is True or False for a missing value, I've > removed the 'mask' attribute entirely, except for ufunc-like functions > np.ismissing and np.isavail which return the two styles of masks. Here's a > high level summary of how I'm thinking of the topic, and what I will > implement: > > *Missing Data Abstraction* > > There appear to be two useful ways to think about missing data that are > worth supporting. > > 1) Unknown yet existing data > 2) Data that doesn't exist > > In 1), an NA value causes outputs to become NA except in a small number of > exceptions such as boolean logic, and in 2), operations treat the data as if > there were a smaller array without the NA values. > > *Temporarily Ignoring Data* > * > * > In some cases, it is useful to flag data as NA temporarily, possibly in > several different ways, for particular calculations or testing out different > ways of throwing away outliers. This is independent of the missing data > abstraction, still requiring a choice of 1) or 2) above. > > *Implementation Techniques* > * > * > There are two mechanisms generally used to implement missing data > abstractions, > * > * > 1) An NA bit pattern > 2) A mask > > I've described a design in the NEP which can include both techniques using > the same interface. The mask approach is strictly more general than the NA > bit pattern approach, except for a few things like the idea of supporting > the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > > My intention is to implement the mask-based design, and possibly also > implement the NA bit pattern design, but if anything gets cut it will be the > NA bit patterns. > > Thanks again for all your input so far, and thanks in advance for your > suggestions for improving this new revision of the NEP. > A very impressive PEP indeed. However, how would corner cases, like >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >>> np.mean(a, skipna=True) >>> np.mean(a) be handled? My concern here is that there always seems to be such corner cases which can only be handled with specific context knowledge. Thus producing 100% generic code to handle 'missing data' is not doable. Thanks, - eat > > -Mark > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 27 13:53:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 27 Jun 2011 12:53:51 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 12:44 PM, eat wrote: > Hi, > > On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: > >> First I'd like to thank everyone for all the feedback you're providing, >> clearly this is an important topic to many people, and the discussion has >> helped clarify the ideas for me. I've renamed and updated the NEP, then >> placed it into the master NumPy repository so it has a more permanent home >> here: >> >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >> >> In the NEP, I've tried to address everything that was raised in the >> original thread and in Nathaniel's followup 'Concepts' thread. To deal with >> the issue of whether a mask is True or False for a missing value, I've >> removed the 'mask' attribute entirely, except for ufunc-like functions >> np.ismissing and np.isavail which return the two styles of masks. Here's a >> high level summary of how I'm thinking of the topic, and what I will >> implement: >> >> *Missing Data Abstraction* >> >> There appear to be two useful ways to think about missing data that are >> worth supporting. >> >> 1) Unknown yet existing data >> 2) Data that doesn't exist >> >> In 1), an NA value causes outputs to become NA except in a small number of >> exceptions such as boolean logic, and in 2), operations treat the data as if >> there were a smaller array without the NA values. >> >> *Temporarily Ignoring Data* >> * >> * >> In some cases, it is useful to flag data as NA temporarily, possibly in >> several different ways, for particular calculations or testing out different >> ways of throwing away outliers. This is independent of the missing data >> abstraction, still requiring a choice of 1) or 2) above. >> >> *Implementation Techniques* >> * >> * >> There are two mechanisms generally used to implement missing data >> abstractions, >> * >> * >> 1) An NA bit pattern >> 2) A mask >> >> I've described a design in the NEP which can include both techniques using >> the same interface. The mask approach is strictly more general than the NA >> bit pattern approach, except for a few things like the idea of supporting >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP. >> >> My intention is to implement the mask-based design, and possibly also >> implement the NA bit pattern design, but if anything gets cut it will be the >> NA bit patterns. >> >> Thanks again for all your input so far, and thanks in advance for your >> suggestions for improving this new revision of the NEP. >> > A very impressive PEP indeed. > > However, how would corner cases, like > > >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) > >>> np.mean(a, skipna=True) > > This should be equivalent to removing all the NA values, then calling mean, like this: >>> b = np.array([], dtype='f8') >>> np.mean(b) /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: RuntimeWarning: invalid value encountered in double_scalars return mean(axis, dtype, out) nan >>> np.mean(a) > > This would return NA, since NA values are sitting in positions that would affect the output result. > be handled? > > My concern here is that there always seems to be such corner cases which > can only be handled with specific context knowledge. Thus producing 100% > generic code to handle 'missing data' is not doable. > Working out the corner cases for the functions that are already in numpy seems tractable to me, how to or whether to support missing data is something the author of each new function will have to consider when missing data support is in NumPy, but I don't think we can do more than provide the mechanisms for people to use. -Mark > Thanks, > - eat > >> >> -Mark >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From e.antero.tammi at gmail.com Mon Jun 27 14:24:31 2011 From: e.antero.tammi at gmail.com (eat) Date: Mon, 27 Jun 2011 21:24:31 +0300 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe wrote: > On Mon, Jun 27, 2011 at 12:44 PM, eat wrote: > >> Hi, >> >> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: >> >>> First I'd like to thank everyone for all the feedback you're providing, >>> clearly this is an important topic to many people, and the discussion has >>> helped clarify the ideas for me. I've renamed and updated the NEP, then >>> placed it into the master NumPy repository so it has a more permanent home >>> here: >>> >>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >>> >>> In the NEP, I've tried to address everything that was raised in the >>> original thread and in Nathaniel's followup 'Concepts' thread. To deal with >>> the issue of whether a mask is True or False for a missing value, I've >>> removed the 'mask' attribute entirely, except for ufunc-like functions >>> np.ismissing and np.isavail which return the two styles of masks. Here's a >>> high level summary of how I'm thinking of the topic, and what I will >>> implement: >>> >>> *Missing Data Abstraction* >>> >>> There appear to be two useful ways to think about missing data that are >>> worth supporting. >>> >>> 1) Unknown yet existing data >>> 2) Data that doesn't exist >>> >>> In 1), an NA value causes outputs to become NA except in a small number >>> of exceptions such as boolean logic, and in 2), operations treat the data as >>> if there were a smaller array without the NA values. >>> >>> *Temporarily Ignoring Data* >>> * >>> * >>> In some cases, it is useful to flag data as NA temporarily, possibly in >>> several different ways, for particular calculations or testing out different >>> ways of throwing away outliers. This is independent of the missing data >>> abstraction, still requiring a choice of 1) or 2) above. >>> >>> *Implementation Techniques* >>> * >>> * >>> There are two mechanisms generally used to implement missing data >>> abstractions, >>> * >>> * >>> 1) An NA bit pattern >>> 2) A mask >>> >>> I've described a design in the NEP which can include both techniques >>> using the same interface. The mask approach is strictly more general than >>> the NA bit pattern approach, except for a few things like the idea of >>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP. >>> >>> My intention is to implement the mask-based design, and possibly also >>> implement the NA bit pattern design, but if anything gets cut it will be the >>> NA bit patterns. >>> >>> Thanks again for all your input so far, and thanks in advance for your >>> suggestions for improving this new revision of the NEP. >>> >> A very impressive PEP indeed. >> > Hi, > >> However, how would corner cases, like >> >> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >> >>> np.mean(a, skipna=True) >> >> This should be equivalent to removing all the NA values, then calling > mean, like this: > > >>> b = np.array([], dtype='f8') > >>> np.mean(b) > /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: > RuntimeWarning: invalid value encountered in double_scalars > return mean(axis, dtype, out) > nan > > >>> np.mean(a) >> >> This would return NA, since NA values are sitting in positions that would > affect the output result. > OK. > > >> be handled? >> >> My concern here is that there always seems to be such corner cases which >> can only be handled with specific context knowledge. Thus producing 100% >> generic code to handle 'missing data' is not doable. >> > > Working out the corner cases for the functions that are already in numpy > seems tractable to me, how to or whether to support missing data is > something the author of each new function will have to consider when missing > data support is in NumPy, but I don't think we can do more than provide the > mechanisms for people to use. > Sure. I'll ride up with this and wait when I'll have some tangible to outperform the 'traditional' NaN handling. - eat > > -Mark > > >> Thanks, >> - eat >> >>> >>> -Mark >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Mon Jun 27 15:38:59 2011 From: ben.root at ou.edu (Benjamin Root) Date: Mon, 27 Jun 2011 14:38:59 -0500 Subject: [Numpy-discussion] histogram2d error with empty inputs Message-ID: I found another empty input edge case. Somewhat recently, we fixed an issue with np.histogram() and empty inputs (so long as the bins are somehow known). >>> np.histogram([], bins=4) (array([0, 0, 0, 0]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) However, histogram2d needs the same treatment. >>> np.histogram([], [], bins=4) (array([ 0., 0.]), array([ 0. , 0.25, 0.5 , 0.75, 1. ]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) The first element in the return tuple needs to be 4x4 (in this case). Thanks, Ben Root -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jun 27 15:59:34 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 27 Jun 2011 15:59:34 -0400 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 2:24 PM, eat wrote: > > > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe wrote: >> >> On Mon, Jun 27, 2011 at 12:44 PM, eat wrote: >>> >>> Hi, >>> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: >>>> >>>> First I'd like to thank everyone for all the feedback you're providing, >>>> clearly this is an important topic to many people, and the discussion has >>>> helped clarify the ideas for me. I've renamed and updated the NEP, then >>>> placed it into the master NumPy repository so it has a more permanent home >>>> here: >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >>>> In the NEP, I've tried to address everything that was raised in the >>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal with >>>> the issue of whether a mask is True or False for a missing value, I've >>>> removed the 'mask' attribute entirely, except for ufunc-like functions >>>> np.ismissing and np.isavail which return the two styles of masks. Here's a >>>> high level summary of how I'm thinking of the topic, and what I will >>>> implement: >>>> Missing Data Abstraction >>>> There appear to be two useful ways to think about missing data that are >>>> worth supporting. >>>> 1) Unknown yet existing data >>>> 2) Data that doesn't exist >>>> In 1), an NA value causes outputs to become NA except in a small number >>>> of exceptions such as boolean logic, and in 2), operations treat the data as >>>> if there were a smaller array without the NA values. >>>> Temporarily Ignoring Data >>>> In some cases, it is useful to flag data as NA temporarily, possibly in >>>> several different ways, for particular calculations or testing out different >>>> ways of throwing away outliers. This is independent of the missing data >>>> abstraction, still requiring a choice of 1) or 2) above. >>>> Implementation Techniques >>>> There are two mechanisms generally used to implement missing data >>>> abstractions, >>>> 1) An NA bit pattern >>>> 2) A mask >>>> I've described a design in the NEP which can include both techniques >>>> using the same interface. The mask approach is strictly more general than >>>> the NA bit pattern approach, except for a few things like the idea of >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP. >>>> My intention is to implement the mask-based design, and possibly also >>>> implement the NA bit pattern design, but if anything gets cut it will be the >>>> NA bit patterns. >>>> Thanks again for all your input so far, and thanks in advance for your >>>> suggestions for improving this new revision of the NEP. >>> >>> A very impressive PEP indeed. > > Hi, >>> >>> However, how would corner cases, like >>> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >>> >>> np.mean(a, skipna=True) >> >> This should be equivalent to removing all the NA values, then calling >> mean, like this: >> >>> b = np.array([], dtype='f8') >> >>> np.mean(b) >> >> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: >> RuntimeWarning: invalid value encountered in double_scalars >> ? return mean(axis, dtype, out) >> nan >>> >>> >>> np.mean(a) >> >> This would return NA, since NA values are sitting in positions that would >> affect the output result. > > OK. >> >> >>> >>> be handled? >>> My?concern?here is that there?always?seems to be such corner cases which >>> can only be handled with specific context knowledge. Thus producing 100% >>> generic code to handle 'missing data' is not doable. >> >> Working out the corner cases for the functions that are already in numpy >> seems tractable to me, how to or whether to support missing data is >> something the author of each new function will have to consider when missing >> data support is in NumPy, but I don't think we can do more than provide the >> mechanisms for people to use. > > Sure. I'll ride up with this and wait when I'll have some tangible to > outperform the 'traditional' NaN handling. > - eat Just a question how things would work with the new model. How can you implement the "use" keyword from R's cov (or cor), with minimal data copying I think the basic masked array version would (or does) just assign 0 to the missing values calculate the covariance or correlation and then correct with the correct count. ------------ cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cov2cor(V) Arguments x a numeric vector, matrix or data frame. y NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient). na.rm logical. Should missing values be removed? use an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs". ------------ especially I'm interested in the complete.obs (drop any rows that contains a NA) case Josef >> >> -Mark >> >>> >>> Thanks, >>> - eat >>>> >>>> -Mark >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From chaoyuejoy at gmail.com Mon Jun 27 16:30:35 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Mon, 27 Jun 2011 22:30:35 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: References: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> Message-ID: Hi Derek! I tried with the lastest version of python(x,y) package with numpy version of 1.6.0. I gave the data to you with reduced columns (10 column) and rows. b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] + [ float for n in range(9)]) works. if you change usecols=tuple(range(10)) to usecols=range(10), it still works. b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) works. but b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] + [ float for n in range(9)]) didn't work. I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows 32-bit system. Please don't spend too much time on this if it's not a potential problem. the final thing is, when I try to do this (I want to try the missing_values in numpy 1.6.0), it gives error: In [33]: import StringIO as StringIO In [34]: data = "1, 2, 3\n4, 5, 6" In [35]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missing_values=2) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) D:\data\LaThuile_ancillary\Jim_Randerson_data\ in () TypeError: 'module' object is not callable I think it must be some problem of my own python configuration? Much thanks again, cheers, Chao 2011/6/27 Derek Homeier > Hi Chao, > > this seems to have become quite a number of different issues! > But let's make sure I understand what's going on... > > > Thanks very much for your quick reply. I make a short summary of what > I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works > when you explicitly specify the columns to be read, and genfromtxt cannot > automatically determine the type if you don't specify the type.... > > > > > In [164]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] > + [ float for n in range(48)]) > ... > > But if I use the following, it gives error: > > > > In [171]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S > > 10'] + [ float for n in range(48)]) > > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call > last) > > > And the above (without the usecols) did work if you explicitly typed > dtype=('S10', float, float....)? That by itself would be quite weird, > because the two should be completely equivalent. > What happens if you cast the generated list to a tuple - > dtype=tuple(['S10'] + [ float for n in range(48)])? > If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you > please file a bug report with complete machine info etc.? But I suspect this > might be an older version, you should also be able to simply use > 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce > this behaviour with the current numpy version. > > > If I don't specify the dtype, it will not recognize the type of the first > column (it displays as nan): > > > > In [172]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) > > > > In [173]: b > > Out[173]: > > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), > > (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), > > (nan, -999.0, -999.0), (nan, -999.0, -999.0)], > > dtype=[('TIMESTAMP', ' ' > ]) > > > You _do_ have to specify 'dtype=None', since the default is 'dtype=float', > as I have remarked in my previous mail. If this does not work, it could be a > matter of the numpy version gain - there were a number of type conversion > issues fixed between 1.5.1 and 1.6.0. > > > > Then the final question is, actually the '-999.0' in the data is missing > value, but I cannot display it as 'nan' by specifying the missing_values as > '-999.0': > > but either I set the missing_values as -999.0 or using a dictionary, it > neither work... > ... > > > > Even this doesn't work (suppose 2 is our missing_value), > > In [184]: data = "1, 2, 3\n4, 5, 6" > > > > In [185]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missin > > g_values=2) > > Out[185]: > > array([(1, 2, 3), (4, 5, 6)], > > dtype=[('f0', ' > OK, same behaviour here - I found the only tests involving 'valid numbers' > as missing_values use masked arrays; for regular ndarrays they seem to be > ignored. I don't know if this is by design - the question is, what do you > need to do with the data if you know ' -999' always means a missing value? > You could certainly manipulate them after reading in... > If you have to convert them already on reading in, and using np.mafromtxt > is not an option, your best bet may be to define a custom converter like > (note you have to include any blanks, if present) > > conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in > range(1,49))) > > Cheers, > Derek > > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 99Burn2003all_new.csv Type: text/csv Size: 27535 bytes Desc: not available URL: From mwwiebe at gmail.com Mon Jun 27 17:01:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 27 Jun 2011 16:01:17 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 2:59 PM, wrote: > On Mon, Jun 27, 2011 at 2:24 PM, eat wrote: > > > > > > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe wrote: > >> > >> On Mon, Jun 27, 2011 at 12:44 PM, eat wrote: > >>> > >>> Hi, > >>> > >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: > >>>> > >>>> First I'd like to thank everyone for all the feedback you're > providing, > >>>> clearly this is an important topic to many people, and the discussion > has > >>>> helped clarify the ideas for me. I've renamed and updated the NEP, > then > >>>> placed it into the master NumPy repository so it has a more permanent > home > >>>> here: > >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > >>>> In the NEP, I've tried to address everything that was raised in the > >>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal > with > >>>> the issue of whether a mask is True or False for a missing value, I've > >>>> removed the 'mask' attribute entirely, except for ufunc-like functions > >>>> np.ismissing and np.isavail which return the two styles of masks. > Here's a > >>>> high level summary of how I'm thinking of the topic, and what I will > >>>> implement: > >>>> Missing Data Abstraction > >>>> There appear to be two useful ways to think about missing data that > are > >>>> worth supporting. > >>>> 1) Unknown yet existing data > >>>> 2) Data that doesn't exist > >>>> In 1), an NA value causes outputs to become NA except in a small > number > >>>> of exceptions such as boolean logic, and in 2), operations treat the > data as > >>>> if there were a smaller array without the NA values. > >>>> Temporarily Ignoring Data > >>>> In some cases, it is useful to flag data as NA temporarily, possibly > in > >>>> several different ways, for particular calculations or testing out > different > >>>> ways of throwing away outliers. This is independent of the missing > data > >>>> abstraction, still requiring a choice of 1) or 2) above. > >>>> Implementation Techniques > >>>> There are two mechanisms generally used to implement missing data > >>>> abstractions, > >>>> 1) An NA bit pattern > >>>> 2) A mask > >>>> I've described a design in the NEP which can include both techniques > >>>> using the same interface. The mask approach is strictly more general > than > >>>> the NA bit pattern approach, except for a few things like the idea of > >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the > NEP. > >>>> My intention is to implement the mask-based design, and possibly also > >>>> implement the NA bit pattern design, but if anything gets cut it will > be the > >>>> NA bit patterns. > >>>> Thanks again for all your input so far, and thanks in advance for your > >>>> suggestions for improving this new revision of the NEP. > >>> > >>> A very impressive PEP indeed. > > > > Hi, > >>> > >>> However, how would corner cases, like > >>> > >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) > >>> >>> np.mean(a, skipna=True) > >> > >> This should be equivalent to removing all the NA values, then calling > >> mean, like this: > >> >>> b = np.array([], dtype='f8') > >> >>> np.mean(b) > >> > >> > /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: > >> RuntimeWarning: invalid value encountered in double_scalars > >> return mean(axis, dtype, out) > >> nan > >>> > >>> >>> np.mean(a) > >> > >> This would return NA, since NA values are sitting in positions that > would > >> affect the output result. > > > > OK. > >> > >> > >>> > >>> be handled? > >>> My concern here is that there always seems to be such corner cases > which > >>> can only be handled with specific context knowledge. Thus producing > 100% > >>> generic code to handle 'missing data' is not doable. > >> > >> Working out the corner cases for the functions that are already in numpy > >> seems tractable to me, how to or whether to support missing data is > >> something the author of each new function will have to consider when > missing > >> data support is in NumPy, but I don't think we can do more than provide > the > >> mechanisms for people to use. > > > > Sure. I'll ride up with this and wait when I'll have some tangible to > > outperform the 'traditional' NaN handling. > > - eat > > Just a question how things would work with the new model. > How can you implement the "use" keyword from R's cov (or cor), with > minimal data copying > > I think the basic masked array version would (or does) just assign 0 > to the missing values calculate the covariance or correlation and then > correct with the correct count. > > ------------ > cov(x, y = NULL, use = "everything", > method = c("pearson", "kendall", "spearman")) > > cor(x, y = NULL, use = "everything", > method = c("pearson", "kendall", "spearman")) > > cov2cor(V) > > Arguments > x a numeric vector, matrix or data frame. > y NULL (default) or a vector, matrix or data frame with compatible > dimensions to x. The default is equivalent to y = x (but more > efficient). > na.rm logical. Should missing values be removed? > > use an optional character string giving a method for computing > covariances in the presence of missing values. This must be (an > abbreviation of) one of the strings "everything", "all.obs", > "complete.obs", "na.or.complete", or "pairwise.complete.obs". > ------------ > > especially I'm interested in the complete.obs (drop any rows that > contains a NA) case > I think this is mainly a matter of extending NumPy's equivalent cov function with a parameter like this. Implemented in C, I'm sure it could be done with minimal copying, I'm not exactly sure how it will have to look implemented in Python. Perhaps someone could try it once I have a basic prototype ready for testing. -Mark > > Josef > > >> > >> -Mark > >> > >>> > >>> Thanks, > >>> - eat > >>>> > >>>> -Mark > >>>> _______________________________________________ > >>>> NumPy-Discussion mailing list > >>>> NumPy-Discussion at scipy.org > >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>>> > >>> > >>> > >>> _______________________________________________ > >>> NumPy-Discussion mailing list > >>> NumPy-Discussion at scipy.org > >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>> > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Mon Jun 27 17:03:25 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 27 Jun 2011 16:03:25 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett wrote: > Hi, > > On Mon, Jun 27, 2011 at 5:53 PM, Charles R Harris > wrote: > > > > > > On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe wrote: > >> > >> First I'd like to thank everyone for all the feedback you're providing, > >> clearly this is an important topic to many people, and the discussion > has > >> helped clarify the ideas for me. I've renamed and updated the NEP, then > >> placed it into the master NumPy repository so it has a more permanent > home > >> here: > >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > >> In the NEP, I've tried to address everything that was raised in the > >> original thread and in Nathaniel's followup 'Concepts' thread. To deal > with > >> the issue of whether a mask is True or False for a missing value, I've > >> removed the 'mask' attribute entirely, except for ufunc-like functions > >> np.ismissing and np.isavail which return the two styles of masks. Here's > a > >> high level summary of how I'm thinking of the topic, and what I will > >> implement: > >> Missing Data Abstraction > >> There appear to be two useful ways to think about missing data that are > >> worth supporting. > >> 1) Unknown yet existing data > >> 2) Data that doesn't exist > >> In 1), an NA value causes outputs to become NA except in a small number > of > >> exceptions such as boolean logic, and in 2), operations treat the data > as if > >> there were a smaller array without the NA values. > >> Temporarily Ignoring Data > >> In some cases, it is useful to flag data as NA temporarily, possibly in > >> several different ways, for particular calculations or testing out > different > >> ways of throwing away outliers. This is independent of the missing data > >> abstraction, still requiring a choice of 1) or 2) above. > >> Implementation Techniques > >> There are two mechanisms generally used to implement missing data > >> abstractions, > >> 1) An NA bit pattern > >> 2) A mask > >> I've described a design in the NEP which can include both techniques > using > >> the same interface. The mask approach is strictly more general than the > NA > >> bit pattern approach, except for a few things like the idea of > supporting > >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > >> My intention is to implement the mask-based design, and possibly also > >> implement the NA bit pattern design, but if anything gets cut it will be > the > >> NA bit patterns. > > > > I have the impression that the mask-based design would be easier. Perhaps > > you could do that one first and folks could try out the API and see how > they > > like it and discover whether the memory overhead is a problem in > practice. > > That seems like a risky strategy to me, as the most likely outcome is > that people worried about memory will avoid masked arrays because they > know they use more memory. The memory usage is predictable and we > won't learn any more about it from use. We most of us already know if > we're having to optimize code for memory. > > You won't get complaints, you'll just lose a group of users, who will, > I suspect, stick to NaNs, unsatisfactory as they are. > This blade cuts both ways, we'd lose a group of users if we don't support masking semantics, too. That said, Travis favors doing both, so there's a good chance there will be time for it. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jun 27 18:03:44 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 27 Jun 2011 18:03:44 -0400 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 5:01 PM, Mark Wiebe wrote: > On Mon, Jun 27, 2011 at 2:59 PM, wrote: >> >> On Mon, Jun 27, 2011 at 2:24 PM, eat wrote: >> > >> > >> > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe wrote: >> >> >> >> On Mon, Jun 27, 2011 at 12:44 PM, eat wrote: >> >>> >> >>> Hi, >> >>> >> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe wrote: >> >>>> >> >>>> First I'd like to thank everyone for all the feedback you're >> >>>> providing, >> >>>> clearly this is an important topic to many people, and the discussion >> >>>> has >> >>>> helped clarify the ideas for me. I've renamed and updated the NEP, >> >>>> then >> >>>> placed it into the master NumPy repository so it has a more permanent >> >>>> home >> >>>> here: >> >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >> >>>> In the NEP, I've tried to address everything that was raised in the >> >>>> original thread and in Nathaniel's followup 'Concepts' thread. To >> >>>> deal with >> >>>> the issue of whether a mask is True or False for a missing value, >> >>>> I've >> >>>> removed the 'mask' attribute entirely, except for ufunc-like >> >>>> functions >> >>>> np.ismissing and np.isavail which return the two styles of masks. >> >>>> Here's a >> >>>> high level summary of how I'm thinking of the topic, and what I will >> >>>> implement: >> >>>> Missing Data Abstraction >> >>>> There appear to be two useful ways to think about missing data that >> >>>> are >> >>>> worth supporting. >> >>>> 1) Unknown yet existing data >> >>>> 2) Data that doesn't exist >> >>>> In 1), an NA value causes outputs to become NA except in a small >> >>>> number >> >>>> of exceptions such as boolean logic, and in 2), operations treat the >> >>>> data as >> >>>> if there were a smaller array without the NA values. >> >>>> Temporarily Ignoring Data >> >>>> In some cases, it is useful to flag data as NA temporarily, possibly >> >>>> in >> >>>> several different ways, for particular calculations or testing out >> >>>> different >> >>>> ways of throwing away outliers. This is independent of the missing >> >>>> data >> >>>> abstraction, still requiring a choice of 1) or 2) above. >> >>>> Implementation Techniques >> >>>> There are two mechanisms generally used to implement missing data >> >>>> abstractions, >> >>>> 1) An NA bit pattern >> >>>> 2) A mask >> >>>> I've described a design in the NEP which can include both techniques >> >>>> using the same interface. The mask approach is strictly more general >> >>>> than >> >>>> the NA bit pattern approach, except for a few things like the idea of >> >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the >> >>>> NEP. >> >>>> My intention is to implement the mask-based design, and possibly also >> >>>> implement the NA bit pattern design, but if anything gets cut it will >> >>>> be the >> >>>> NA bit patterns. >> >>>> Thanks again for all your input so far, and thanks in advance for >> >>>> your >> >>>> suggestions for improving this new revision of the NEP. >> >>> >> >>> A very impressive PEP indeed. >> > >> > Hi, >> >>> >> >>> However, how would corner cases, like >> >>> >> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >> >>> >>> np.mean(a, skipna=True) >> >> >> >> This should be equivalent to removing all the NA values, then calling >> >> mean, like this: >> >> >>> b = np.array([], dtype='f8') >> >> >>> np.mean(b) >> >> >> >> >> >> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: >> >> RuntimeWarning: invalid value encountered in double_scalars >> >> ? return mean(axis, dtype, out) >> >> nan >> >>> >> >>> >>> np.mean(a) >> >> >> >> This would return NA, since NA values are sitting in positions that >> >> would >> >> affect the output result. >> > >> > OK. >> >> >> >> >> >>> >> >>> be handled? >> >>> My?concern?here is that there?always?seems to be such corner cases >> >>> which >> >>> can only be handled with specific context knowledge. Thus producing >> >>> 100% >> >>> generic code to handle 'missing data' is not doable. >> >> >> >> Working out the corner cases for the functions that are already in >> >> numpy >> >> seems tractable to me, how to or whether to support missing data is >> >> something the author of each new function will have to consider when >> >> missing >> >> data support is in NumPy, but I don't think we can do more than provide >> >> the >> >> mechanisms for people to use. >> > >> > Sure. I'll ride up with this and wait when I'll have some tangible to >> > outperform the 'traditional' NaN handling. >> > - eat >> >> Just a question how things would work with the new model. >> How can you implement the "use" keyword from R's cov (or cor), with >> minimal data copying >> >> I think the basic masked array version would (or does) just assign 0 >> to the missing values calculate the covariance or correlation and then >> correct with the correct count. >> >> ------------ >> cov(x, y = NULL, use = "everything", >> ? ?method = c("pearson", "kendall", "spearman")) >> >> cor(x, y = NULL, use = "everything", >> ? ? method = c("pearson", "kendall", "spearman")) >> >> cov2cor(V) >> >> Arguments >> x ? a numeric vector, matrix or data frame. >> ?y ?NULL (default) or a vector, matrix or data frame with compatible >> dimensions to x. The default is equivalent to y = x (but more >> efficient). >> ?na.rm ? logical. Should missing values be removed? >> >> ?use ? an optional character string giving a method for computing >> covariances in the presence of missing values. This must be (an >> abbreviation of) one of the strings "everything", "all.obs", >> "complete.obs", "na.or.complete", or "pairwise.complete.obs". >> ------------ >> >> especially I'm interested in the complete.obs (drop any rows that >> contains a NA) case > > I think this is mainly a matter of extending NumPy's equivalent cov function > with a parameter like this. Implemented in C, I'm sure it could be done with > minimal copying, I'm not exactly sure how it will have to look implemented > in Python. Perhaps someone could try it once I have a basic prototype ready > for testing. This is just a typical example, going to C doesn't help, whoever is rewriting scipy.stats.mstats or is writing similar statistical code will need to do this all the time. Josef > -Mark > >> >> Josef >> >> >> >> >> -Mark >> >> >> >>> >> >>> Thanks, >> >>> - eat >> >>>> >> >>>> -Mark >> >>>> _______________________________________________ >> >>>> NumPy-Discussion mailing list >> >>>> NumPy-Discussion at scipy.org >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >>>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> NumPy-Discussion mailing list >> >>> NumPy-Discussion at scipy.org >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >>> >> >> >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion at scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> > >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From pgmdevlist at gmail.com Mon Jun 27 18:24:03 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Tue, 28 Jun 2011 00:24:03 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <7939796C-8EA2-4236-8FA3-88EB6B7B5E19@gmail.com> On Jun 27, 2011, at 9:59 PM, josef.pktd at gmail.com wrote: > > Just a question how things would work with the new model. > How can you implement the "use" keyword from R's cov (or cor), with > minimal data copying > > I think the basic masked array version would (or does) just assign 0 > to the missing values calculate the covariance or correlation and then > correct with the correct count. Basically, yes. Basic operations have a generic internal fill value (0 for sum/subtraction, 1 for multiplication/division), then you just have to correct by the count. > > especially I'm interested in the complete.obs (drop any rows that > contains a NA) case In numpy.ma, there are functions to drop rows/columns that contain a masked value (they are in numpy.ma.extras, if I recall correctly): just filter your data by these functions before parsing it to np.cov. That's the kind of trivial example that is probably not worth overloading a function with optional parameters for. From wardefar at iro.umontreal.ca Mon Jun 27 19:14:53 2011 From: wardefar at iro.umontreal.ca (David Warde-Farley) Date: Mon, 27 Jun 2011 19:14:53 -0400 Subject: [Numpy-discussion] [ANN] Theano 0.4.0 released Message-ID: <2BE955B9-2B1E-40AE-B8BD-AF4AFB8CD6D2@iro.umontreal.ca> ========================== Announcing Theano 0.4.0 =========================== This is a major release, with lots of new features, bug fixes, and some interface changes (deprecated or potentially misleading features were removed). The upgrade is recommended for everybody, unless you rely on deprecated features that have been removed. For those using the bleeding edge version in the mercurial repository, we encourage you to update to the `0.4.0` tag. Deleting old cache ------------------ The caching mechanism for compiled C modules has been updated. In some cases, using previously-compiled modules with the new version of Theano can lead to high memory usage and code slow-down. If you experience these symptoms, we encourage you to clear your cache. The easiest way to do that is to execute: theano-cache clear (The theano-cache executable is in Theano/bin.) What's New ---------- [Include the content of NEWS.txt here] Change in output memory storage for Ops: If you implemented custom Ops, with either C or Python implementation, this will concern you. The contract for memory storage of Ops has been changed. In particular, it is no longer guaranteed that output memory buffers are either empty, or allocated by a previous execution of the same Op. Right now, here is the situation: * For Python implementation (perform), what is inside output_storage may have been allocated from outside the perform() function, for instance by another node (e.g., Scan) or the Mode. If that was the case, the memory can be assumed to be C-contiguous (for the moment). * For C implementations (c_code), nothing has changed yet. In a future version, the content of the output storage, both for Python and C versions, will either be NULL, or have the following guarantees: * It will be a Python object of the appropriate Type (for a Tensor variable, a numpy.ndarray, for a GPU variable, a CudaNdarray, for instance) * It will have the correct number of dimensions, and correct dtype However, its shape and memory layout (strides) will not be guaranteed. When that change is made, the config flag DebugMode.check_preallocated_output will help you find implementations that are not up-to-date. Deprecation: * tag.shape attribute deprecated (#633) * CudaNdarray_new_null is deprecated in favour of CudaNdarray_New * Dividing integers with / is deprecated: use // for integer division, or cast one of the integers to a float type if you want a float result (you may also change this behavior with config.int_division). * Removed (already deprecated) sandbox/compile module * Removed (already deprecated) incsubtensor and setsubtensor functions, inc_subtensor and set_subtensor are to be used instead. Bugs fixed: * In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error. * THEANO_FLAGS='optimizer=None' now works as expected * Fixed memory leak in error handling on GPU-to-host copy * Fix relating specifically to Python 2.7 on Mac OS X * infer_shape can now handle Python longs * Trying to compute x % y with one or more arguments being complex now raises an error. * The output of random samples computed with uniform(..., dtype=...) is guaranteed to be of the specified dtype instead of potentially being of a higher-precision dtype. * The perform() method of DownsampleFactorMax did not give the right result when reusing output storage. This happen only if you use the Theano flags 'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'. Crash fixed: * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution crash. * Some optimizations crashed when the "ShapeOpt" optimization was disabled. Optimization: * Optimize all subtensor followed by subtensor. GPU: * Move to the gpu fused elemwise that have other dtype then float32 in them (except float64) if the input and output are float32. * This allow to move elemwise comparisons to the GPU if we cast it to float32 after that. * Implemented CudaNdarray.ndim to have the same interface in ndarray. * Fixed slowdown caused by multiple chained views on CudaNdarray objects * CudaNdarray_alloc_contiguous changed so as to never try to free memory on a view: new "base" property * Safer decref behaviour in CudaNdarray in case of failed allocations * New GPU implementation of tensor.basic.outer * Multinomial random variates now available on GPU New features: * ProfileMode * profile the scan overhead * simple hook system to add profiler * reordered the output to be in the order of more general to more specific * DebugMode now checks Ops with different patterns of preallocated memory, configured by config.DebugMode.check_preallocated_output. * var[vector of index] now work, (grad work recursively, the direct grad work inplace, gpu work) * limitation: work only of the outer most dimensions. * New way to test the graph as we build it. Allow to easily find the source of shape mismatch error: `http://deeplearning.net/software/theano/tutorial/debug_faq.html#interactive-debugger`__ * cuda.root inferred if nvcc is on the path, otherwise defaults to /usr/local/cuda * Better graph printing for graphs involving a scan subgraph * Casting behavior can be controlled through config.cast_policy, new (experimental) mode. * Smarter C module cache, avoiding erroneous usage of the wrong C implementation when some options change, and avoiding recompiling the same module multiple times in some situations. * The "theano-cache clear" command now clears the cache more thoroughly. * More extensive linear algebra ops (CPU only) that wrap scipy.linalg now available in the sandbox. * CUDA devices 4 - 16 should now be available if present. * infer_shape support for the View op, better infer_shape support in Scan * infer_shape supported in all case of subtensor * tensor.grad now gives an error by default when computing the gradient wrt a node that is disconnected from the cost (not in the graph, or no continuous path from that op to the cost). * New tensor.isnan and isinf functions. Documentation: * Better commenting of cuda_ndarray.cu * Fixes in the scan documentation: add missing declarations/print statements * Better error message on failed __getitem__ * Updated documentation on profile mode * Better documentation of testing on Windows * Better documentation of the 'run_individual_tests' script Unit tests: * More strict float comparaison by default * Reuse test for subtensor of tensor for gpu tensor(more gpu test) * Tests that check for aliased function inputs and assure appropriate copying (#374) * Better test of copies in CudaNdarray * New tests relating to the new base pointer requirements * Better scripts to run tests individually or in batches * Some tests are now run whenever cuda is available and not just when it has been enabled before * Tests display less pointless warnings. Other: * Correctly put the broadcast flag to True in the output var of a Reshape op when we receive an int 1 in the new shape. * pydotprint: high contrast mode is now the default, option to print more compact node names. * pydotprint: How trunk label that are too long. * More compact printing (ignore leading "Composite" in op names) Download -------- You can download Theano from http://pypi.python.org/pypi/Theano. Description ----------- Theano is a Python library that allows you to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays. It is built on top of NumPy. Theano features: * tight integration with NumPy: a similar interface to NumPy's. numpy.ndarrays are also used internally in Theano-compiled functions. * transparent use of a GPU: perform data-intensive computations up to 140x faster than on a CPU (support for float32 only). * efficient symbolic differentiation: Theano can compute derivatives for functions of one or many inputs. * speed and stability optimizations: avoid nasty bugs when computing expressions such as log(1+ exp(x)) for large values of x. * dynamic C code generation: evaluate expressions faster. * extensive unit-testing and self-verification: includes tools for detecting and diagnosing bugs and/or potential problems. Theano has been powering large-scale computationally intensive scientific research since 2007, but it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal). Resources --------- About Theano: http://deeplearning.net/software/theano/ About NumPy: http://numpy.scipy.org/ About SciPy: http://www.scipy.org/ Machine Learning Tutorial with Theano on Deep Architectures: http://deeplearning.net/tutorial/ Acknowledgments --------------- I would like to thank all contributors of Theano. For this particular release, many people have helped during the release sprint: (in alphabetical order) Frederic Bastien, James Bergstra, Nicolas Boulanger-Lewandowski, Raul Chandias Ferrari, Olivier Delalleau, Guillaume Desjardins, Philippe Hamel, Pascal Lamblin, Razvan Pascanu and David Warde-Farley. Also, thank you to all NumPy and SciPy developers as Theano builds on its strength. All questions/comments are always welcome on the Theano mailing-lists ( http://deeplearning.net/software/theano/ ) From kwgoodman at gmail.com Mon Jun 27 20:07:18 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Mon, 27 Jun 2011 17:07:18 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 8:55 AM, Mark Wiebe wrote: > First I'd like to thank everyone for all the feedback you're providing, > clearly this is an important topic to many people, and the discussion has > helped clarify the ideas for me. I've renamed and updated the NEP, then > placed it into the master NumPy repository so it has a more permanent home > here: > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > In the NEP, I've tried to address everything that was raised in the original > thread and in Nathaniel's followup 'Concepts' thread. To deal with the issue > of whether a mask is True or False for a missing value, I've removed the > 'mask' attribute entirely, except for ufunc-like functions np.ismissing and > np.isavail which return the two styles of masks. Here's a high level summary > of how I'm thinking of the topic, and what I will implement: > Missing Data Abstraction > There appear to be two useful ways to think about missing data that are > worth supporting. > 1) Unknown yet existing data > 2) Data that doesn't exist > In 1), an NA value causes outputs to become NA except in a small number of > exceptions such as boolean logic, and in 2), operations treat the data as if > there were a smaller array without the NA values. > Temporarily Ignoring Data > In some cases, it is useful to flag data as NA temporarily, possibly in > several different ways, for particular calculations or testing out different > ways of throwing away outliers. This is independent of the missing data > abstraction, still requiring a choice of 1) or 2) above. > Implementation Techniques > There are two mechanisms generally used to implement missing data > abstractions, > 1) An NA bit pattern > 2) A mask > I've described a design in the NEP which can include both techniques using > the same interface. The mask approach is strictly more general than the NA > bit pattern approach, except for a few things like the idea of supporting > the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > My intention is to implement the mask-based design, and possibly also > implement the NA bit pattern design, but if anything gets cut it will be the > NA bit patterns. > Thanks again for all your input so far, and thanks in advance for your > suggestions for improving this new revision of the NEP. I'm trying to understand this part of the missing data NEP: "While numpy.NA works to mask values, it does not itself have a dtype. This means that returning the numpy.NA singleton from an operation like 'arr[0]' would be throwing away the dtype, which is still valuable to retain, so 'arr[0]' will return a zero-dimensional array either with its value masked, or containing the NA bit pattern for the array's dtype." If I do something like this in Cython: cdef np.float64_t ai for i in range(n): ai = a[i] ... Then I need to specify the type of ai, say float64 as above. What happens when a[i] is np.NA? Is ai still a float64? If NA is a bit pattern taken from float64 then a[i] could be float64, but if it is a 0d array then it would not be float64 and I assume I would run into problems or have to cast. So what does all this mean for iterating over each element of an array in Cython or C? Would I need to check the mask of element i first and only assign to ai if the mask is True (meaning not missing)? From derek at astro.physik.uni-goettingen.de Mon Jun 27 20:14:00 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Tue, 28 Jun 2011 02:14:00 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: References: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> Message-ID: <0951AD3C-752B-433E-93D9-B7C80745293A@astro.physik.uni-goettingen.de> Hi Chao, by mistake did not reply to the list last time... On 27.06.2011, at 10:30PM, Chao YUE wrote: Hi Derek! > > I tried with the lastest version of python(x,y) package with numpy version of 1.6.0. I gave the data to you with reduced columns (10 column) and rows. > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] + [ float for n in range(9)]) works. > if you change usecols=tuple(range(10)) to usecols=range(10), it still works. > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) works. > > but b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] + [ float for n in range(9)]) didn't work. > > I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows 32-bit system. > > Please don't spend too much time on this if it's not a potential problem. > OK, dtype=None works on 1.6.0, that's the important bit. From your example file it seems the dtype list does work not without specifying usecols, because your header contains and excess semicolon in the field "Air temperature (High; HMP45C)", thus genfromtxt expects more data columns than actually exist. If you replace the semicolon you should be set (or, if I may suggest, write another header line with catchier field names so you don't have to work with array fields like "b['Water vapor density by LiCor 7500']" ;-). Otherwise both options work for me with python2.6+numpy-1.5.1 as well as 1.6.0/1.6.1rc1. I am curious though why your python interpreter gave this error message: > ValueError Traceback (most recent call last) > > D:\data\LaThuile_ancillary\Jim_Randerson_data\ in () > > C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, dtype, co > mments, delimiter, skiprows, skip_header, skip_footer, converters, missing, miss > ing_values, filling_values, usecols, names, excludelist, deletechars, replace_sp > ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_rais > e) > 1449 # Raise an exception ? > > 1450 if invalid_raise: > -> 1451 raise ValueError(errmsg) > 1452 # Issue a warning ? > > 1453 else: > > ValueError since ipython2.6 on my Mac reported this: ... 1450 if invalid_raise: -> 1451 raise ValueError(errmsg) 1452 # Issue a warning ? 1453 else: ValueError: Some errors were detected ! Line #3 (got 10 columns instead of 11) Line #4 (got 10 columns instead of 11) etc.... which of course provided the right lead to the problem - was the actual errmsg really missing, or did you cut the message too soon? > the final thing is, when I try to do this (I want to try the missing_values in numpy 1.6.0), it gives error: > > In [33]: import StringIO as StringIO > > In [34]: data = "1, 2, 3\n4, 5, 6" > > In [35]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missing_values=2) > --------------------------------------------------------------------------- > TypeError Traceback (most recent call last) > > D:\data\LaThuile_ancillary\Jim_Randerson_data\ in () > > TypeError: 'module' object is not callable > You want to use "from StringIO import StringIO" (or write "StringIO.StringIO(data)". But again, this will not work the way you expect it to with int/float numbers set as missing_values, and reading to regular arrays. I've tested this on 1.6.1 and the current development branch as well, and the missing_values are only considered for masked arrays. This is not likely to change soon, and may actually be intentional, so to process those numbers on read-in, your best option would be to define a custom set of "converters=conv" as shown in my last mail. Cheers, Derek > 2011/6/27 Derek Homeier > Hi Chao, > > this seems to have become quite a number of different issues! > But let's make sure I understand what's going on... > > > Thanks very much for your quick reply. I make a short summary of what I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works when you explicitly specify the columns to be read, and genfromtxt cannot automatically determine the type if you don't specify the type.... > > > > > In [164]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] + [ float for n in range(48)]) > ... > > But if I use the following, it gives error: > > > > In [171]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S > > 10'] + [ float for n in range(48)]) > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call last) > > > And the above (without the usecols) did work if you explicitly typed dtype=('S10', float, float....)? That by itself would be quite weird, because the two should be completely equivalent. > What happens if you cast the generated list to a tuple - dtype=tuple(['S10'] + [ float for n in range(48)])? > If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you please file a bug report with complete machine info etc.? But I suspect this might be an older version, you should also be able to simply use 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce this behaviour with the current numpy version. > > > If I don't specify the dtype, it will not recognize the type of the first column (it displays as nan): > > > > In [172]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) > > > > In [173]: b > > Out[173]: > > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), > > (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), > > (nan, -999.0, -999.0), (nan, -999.0, -999.0)], > > dtype=[('TIMESTAMP', ' > ]) > > > You _do_ have to specify 'dtype=None', since the default is 'dtype=float', as I have remarked in my previous mail. If this does not work, it could be a matter of the numpy version gain - there were a number of type conversion issues fixed between 1.5.1 and 1.6.0. > > > > Then the final question is, actually the '-999.0' in the data is missing value, but I cannot display it as 'nan' by specifying the missing_values as '-999.0': > > but either I set the missing_values as -999.0 or using a dictionary, it neither work... > ... > > > > Even this doesn't work (suppose 2 is our missing_value), > > In [184]: data = "1, 2, 3\n4, 5, 6" > > > > In [185]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missin > > g_values=2) > > Out[185]: > > array([(1, 2, 3), (4, 5, 6)], > > dtype=[('f0', ' > OK, same behaviour here - I found the only tests involving 'valid numbers' as missing_values use masked arrays; for regular ndarrays they seem to be ignored. I don't know if this is by design - the question is, what do you need to do with the data if you know ' -999' always means a missing value? You could certainly manipulate them after reading in... > If you have to convert them already on reading in, and using np.mafromtxt is not an option, your best bet may be to define a custom converter like (note you have to include any blanks, if present) > > conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in range(1,49))) > > Cheers, > Derek > > > > > -- > *********************************************************************************** > Chao YUE > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > UMR 1572 CEA-CNRS-UVSQ > Batiment 712 - Pe 119 > 91191 GIF Sur YVETTE Cedex > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > ************************************************************************************ > > <99Burn2003all_new.csv> From mwwiebe at gmail.com Mon Jun 27 22:04:30 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Mon, 27 Jun 2011 21:04:30 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 7:07 PM, Keith Goodman wrote: > On Mon, Jun 27, 2011 at 8:55 AM, Mark Wiebe wrote: > > First I'd like to thank everyone for all the feedback you're providing, > > clearly this is an important topic to many people, and the discussion has > > helped clarify the ideas for me. I've renamed and updated the NEP, then > > placed it into the master NumPy repository so it has a more permanent > home > > here: > > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > > In the NEP, I've tried to address everything that was raised in the > original > > thread and in Nathaniel's followup 'Concepts' thread. To deal with the > issue > > of whether a mask is True or False for a missing value, I've removed the > > 'mask' attribute entirely, except for ufunc-like functions np.ismissing > and > > np.isavail which return the two styles of masks. Here's a high level > summary > > of how I'm thinking of the topic, and what I will implement: > > Missing Data Abstraction > > There appear to be two useful ways to think about missing data that are > > worth supporting. > > 1) Unknown yet existing data > > 2) Data that doesn't exist > > In 1), an NA value causes outputs to become NA except in a small number > of > > exceptions such as boolean logic, and in 2), operations treat the data as > if > > there were a smaller array without the NA values. > > Temporarily Ignoring Data > > In some cases, it is useful to flag data as NA temporarily, possibly in > > several different ways, for particular calculations or testing out > different > > ways of throwing away outliers. This is independent of the missing data > > abstraction, still requiring a choice of 1) or 2) above. > > Implementation Techniques > > There are two mechanisms generally used to implement missing data > > abstractions, > > 1) An NA bit pattern > > 2) A mask > > I've described a design in the NEP which can include both techniques > using > > the same interface. The mask approach is strictly more general than the > NA > > bit pattern approach, except for a few things like the idea of supporting > > the dtype 'NA[f8,InfNan]' which you can read about in the NEP. > > My intention is to implement the mask-based design, and possibly also > > implement the NA bit pattern design, but if anything gets cut it will be > the > > NA bit patterns. > > Thanks again for all your input so far, and thanks in advance for your > > suggestions for improving this new revision of the NEP. > > I'm trying to understand this part of the missing data NEP: > > "While numpy.NA works to mask values, it does not itself have a dtype. > This means that returning the numpy.NA singleton from an operation > like 'arr[0]' would be throwing away the dtype, which is still > valuable to retain, so 'arr[0]' will return a zero-dimensional array > either with its value masked, or containing the NA bit pattern for the > array's dtype." > > If I do something like this in Cython: > > cdef np.float64_t ai > for i in range(n): > ai = a[i] > ... > > Then I need to specify the type of ai, say float64 as above. > > What happens when a[i] is np.NA? Is ai still a float64? If NA is a bit > pattern taken from float64 then a[i] could be float64, but if it is a > 0d array then it would not be float64 and I assume I would run into > problems or have to cast. > > So what does all this mean for iterating over each element of an array > in Cython or C? Would I need to check the mask of element i first and > only assign to ai if the mask is True (meaning not missing)? > I'll have to add mention of Cython in the NEP. What should happen in Cython is the same thing that happens in Python, that the abstractions described in the NEP are followed precisely. Until the ability to work with missing values is added to Cython, the above will not be possible. The type np.float64_t isn't correct, Cython will need to add its own versions of np.nafloat64_t which it translates to/from by calling the appropriate NumPy APIs. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nadavh at visionsense.com Tue Jun 28 00:38:29 2011 From: nadavh at visionsense.com (Nadav Horesh) Date: Mon, 27 Jun 2011 21:38:29 -0700 Subject: [Numpy-discussion] Recommndations for an easy GUI Message-ID: <26FC23E7C398A64083C980D16001012D246B98519A@VA3DIAXVS361.RED001.local> I have an application which generates and displays RGB images as rate of several frames/seconds (5-15). Currently I use Tkinter+PIL, but I have a problem that it slows down the rate significantly. I am looking for a fast and easy alternative. Platform: Linux I prefer tools that would work also with python3 I looked at the following: 1. matplotlib's imshow: Easy but too slow. 2. pyqwt: Easy, fast, does not support rgb images (only grayscale) 3. pygame: Nice but lacking widgets like buttons etc. 4. PyQT: Can do, I would like something simpler (I'll rather focus on computations, not GUI) Nadav. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Tue Jun 28 01:32:44 2011 From: ben.root at ou.edu (Benjamin Root) Date: Tue, 28 Jun 2011 00:32:44 -0500 Subject: [Numpy-discussion] Recommndations for an easy GUI In-Reply-To: <26FC23E7C398A64083C980D16001012D246B98519A@VA3DIAXVS361.RED001.local> References: <26FC23E7C398A64083C980D16001012D246B98519A@VA3DIAXVS361.RED001.local> Message-ID: On Monday, June 27, 2011, Nadav Horesh wrote: > > > > > > > > I have an application which generates and displays RGB images as rate of several frames/seconds (5-15). Currently I use Tkinter+PIL, but I have a problem that it slows down the rate significantly. > I am looking for a fast and easy alternative. > > Platform: Linux > I prefer tools that would work also with python3 > > I looked at the following: > 1. matplotlib's imshow: Easy but too slow. > 2. pyqwt: Easy, fast, does not support rgb images (only grayscale) > 3. pygame: Nice but lacking widgets like buttons etc. > 4. PyQT: Can do, I would like something simpler (I'll rather focus on computations, not GUI) > > ?? Nadav. > > > > You can get imshow to be fast. If you save the object returned fron the call to imshow, there is a method called set_data() (I believe), which you can pass values to after the initial call. I have gotten very good framerates on my EeePC doing that. Most of the computation in imshow is with respect to the creation of the object, not the display. Ben Root From Nicolas.Rougier at inria.fr Tue Jun 28 02:46:40 2011 From: Nicolas.Rougier at inria.fr (Nicolas Rougier) Date: Tue, 28 Jun 2011 08:46:40 +0200 Subject: [Numpy-discussion] Recommndations for an easy GUI In-Reply-To: <26FC23E7C398A64083C980D16001012D246B98519A@VA3DIAXVS361.RED001.local> References: <26FC23E7C398A64083C980D16001012D246B98519A@VA3DIAXVS361.RED001.local> Message-ID: <710674A8-E245-4D40-933C-5400A6A03B4D@inria.fr> Have a look at glumpy: http://code.google.com/p/glumpy/ It's quite simple and very fast for images (it's based on OpenGL/shaders). Nicolas On Jun 28, 2011, at 6:38 AM, Nadav Horesh wrote: > I have an application which generates and displays RGB images as rate of several frames/seconds (5-15). Currently I use Tkinter+PIL, but I have a problem that it slows down the rate significantly. I am looking for a fast and easy alternative. > > Platform: Linux > I prefer tools that would work also with python3 > > I looked at the following: > 1. matplotlib's imshow: Easy but too slow. > 2. pyqwt: Easy, fast, does not support rgb images (only grayscale) > 3. pygame: Nice but lacking widgets like buttons etc. > 4. PyQT: Can do, I would like something simpler (I'll rather focus on computations, not GUI) > > Nadav. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From chaoyuejoy at gmail.com Tue Jun 28 03:00:57 2011 From: chaoyuejoy at gmail.com (Chao YUE) Date: Tue, 28 Jun 2011 09:00:57 +0200 Subject: [Numpy-discussion] data type specification when using numpy.genfromtxt In-Reply-To: <0951AD3C-752B-433E-93D9-B7C80745293A@astro.physik.uni-goettingen.de> References: <326FD82A-98C4-4E8A-8E73-295F9690C2AD@astro.physik.uni-goettingen.de> <0951AD3C-752B-433E-93D9-B7C80745293A@astro.physik.uni-goettingen.de> Message-ID: Thanks very much!! you are right. It's becuase the extra semicolon in the head row. I have no problems anymore. I thank you for your time. cheeers, Chao 2011/6/28 Derek Homeier > Hi Chao, > > by mistake did not reply to the list last time... > > On 27.06.2011, at 10:30PM, Chao YUE wrote: > Hi Derek! > > > > I tried with the lastest version of python(x,y) package with numpy > version of 1.6.0. I gave the data to you with reduced columns (10 column) > and rows. > > > > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] > + [ float for n in range(9)]) works. > > if you change usecols=tuple(range(10)) to usecols=range(10), it still > works. > > > > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) > works. > > > > but > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] > + [ float for n in range(9)]) didn't work. > > > > I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows > 32-bit system. > > > > Please don't spend too much time on this if it's not a potential problem. > > > OK, dtype=None works on 1.6.0, that's the important bit. > >From your example file it seems the dtype list does work not without > specifying usecols, because your header contains and excess semicolon in the > field "Air temperature (High; HMP45C)", thus genfromtxt expects more data > columns than actually exist. If you replace the semicolon you should be set > (or, if I may suggest, write another header line with catchier field names > so you don't have to work with array fields like "b['Water vapor density by > LiCor 7500']" ;-). > Otherwise both options work for me with python2.6+numpy-1.5.1 as well as > 1.6.0/1.6.1rc1. > > I am curious though why your python interpreter gave this error message: > > ValueError Traceback (most recent call > last) > > > > D:\data\LaThuile_ancillary\Jim_Randerson_data\ in > () > > > > C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, > dtype, co > > mments, delimiter, skiprows, skip_header, skip_footer, converters, > missing, miss > > ing_values, filling_values, usecols, names, excludelist, deletechars, > replace_sp > > ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, > invalid_rais > > e) > > 1449 # Raise an exception ? > > > > 1450 if invalid_raise: > > -> 1451 raise ValueError(errmsg) > > 1452 # Issue a warning ? > > > > 1453 else: > > > > ValueError > > since ipython2.6 on my Mac reported this: > ... > 1450 if invalid_raise: > -> 1451 raise ValueError(errmsg) > 1452 # Issue a warning ? > > 1453 else: > > ValueError: Some errors were detected ! > Line #3 (got 10 columns instead of 11) > Line #4 (got 10 columns instead of 11) > etc.... > which of course provided the right lead to the problem - was the actual > errmsg really missing, or did you cut the message too soon? > > > the final thing is, when I try to do this (I want to try the > missing_values in numpy 1.6.0), it gives error: > > > > In [33]: import StringIO as StringIO > > > > In [34]: data = "1, 2, 3\n4, 5, 6" > > > > In [35]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missing_values=2) > > > --------------------------------------------------------------------------- > > TypeError Traceback (most recent call > last) > > > > D:\data\LaThuile_ancillary\Jim_Randerson_data\ in > () > > > > TypeError: 'module' object is not callable > > > You want to use "from StringIO import StringIO" (or write > "StringIO.StringIO(data)". > But again, this will not work the way you expect it to with int/float > numbers set as missing_values, and reading to regular arrays. I've tested > this on 1.6.1 and the current development branch as well, and the > missing_values are only considered for masked arrays. This is not likely to > change soon, and may actually be intentional, so to process those numbers on > read-in, your best option would be to define a custom set of > "converters=conv" as shown in my last mail. > > Cheers, > Derek > > > 2011/6/27 Derek Homeier > > Hi Chao, > > > > this seems to have become quite a number of different issues! > > But let's make sure I understand what's going on... > > > > > Thanks very much for your quick reply. I make a short summary of what > I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works > when you explicitly specify the columns to be read, and genfromtxt cannot > automatically determine the type if you don't specify the type.... > > > > > > > > In [164]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] > + [ float for n in range(48)]) > > ... > > > But if I use the following, it gives error: > > > > > > In [171]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S > > > 10'] + [ float for n in range(48)]) > > > > --------------------------------------------------------------------------- > > > ValueError Traceback (most recent call > last) > > > > > And the above (without the usecols) did work if you explicitly typed > dtype=('S10', float, float....)? That by itself would be quite weird, > because the two should be completely equivalent. > > What happens if you cast the generated list to a tuple - > dtype=tuple(['S10'] + [ float for n in range(48)])? > > If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you > please file a bug report with complete machine info etc.? But I suspect this > might be an older version, you should also be able to simply use > 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce > this behaviour with the current numpy version. > > > > > If I don't specify the dtype, it will not recognize the type of the > first column (it displays as nan): > > > > > > In [172]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) > > > > > > In [173]: b > > > Out[173]: > > > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), > > > (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), > > > (nan, -999.0, -999.0), (nan, -999.0, -999.0)], > > > dtype=[('TIMESTAMP', ' ('Net_radiation', ' > > ]) > > > > > You _do_ have to specify 'dtype=None', since the default is > 'dtype=float', as I have remarked in my previous mail. If this does not > work, it could be a matter of the numpy version gain - there were a number > of type conversion issues fixed between 1.5.1 and 1.6.0. > > > > > > Then the final question is, actually the '-999.0' in the data is > missing value, but I cannot display it as 'nan' by specifying the > missing_values as '-999.0': > > > but either I set the missing_values as -999.0 or using a dictionary, it > neither work... > > ... > > > > > > Even this doesn't work (suppose 2 is our missing_value), > > > In [184]: data = "1, 2, 3\n4, 5, 6" > > > > > > In [185]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missin > > > g_values=2) > > > Out[185]: > > > array([(1, 2, 3), (4, 5, 6)], > > > dtype=[('f0', ' > > > OK, same behaviour here - I found the only tests involving 'valid > numbers' as missing_values use masked arrays; for regular ndarrays they seem > to be ignored. I don't know if this is by design - the question is, what do > you need to do with the data if you know ' -999' always means a missing > value? You could certainly manipulate them after reading in... > > If you have to convert them already on reading in, and using np.mafromtxt > is not an option, your best bet may be to define a custom converter like > (note you have to include any blanks, if present) > > > > conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in > range(1,49))) > > > > Cheers, > > Derek > > > > > > > > > > -- > > > *********************************************************************************** > > Chao YUE > > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > > UMR 1572 CEA-CNRS-UVSQ > > Batiment 712 - Pe 119 > > 91191 GIF Sur YVETTE Cedex > > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > > > ************************************************************************************ > > > > <99Burn2003all_new.csv> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Tue Jun 28 07:25:07 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 28 Jun 2011 12:25:07 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Mon, Jun 27, 2011 at 10:03 PM, Mark Wiebe wrote: > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett ... >> That seems like a risky strategy to me, as the most likely outcome is >> that people worried about memory will avoid masked arrays because they >> know they use more memory. ?The memory usage is predictable and we >> won't learn any more about it from use. ?We most of us already know if >> we're having to optimize code for memory. >> >> You won't get complaints, you'll just lose a group of users, who will, >> I suspect, stick to NaNs, unsatisfactory as they are. > > This blade cuts both ways, we'd lose a group of users if we don't support > masking semantics, too. I didn't mean to agitate for my own use-case, I was only saying that 'implement masking, wait and see for na-dtype' was not a good strategy for deciding. > That said, Travis favors doing both, so there's a good chance there will be > time for it. That's very encouraging. I hope I can contribute somehow when the time comes, with review or some other way, Thanks a lot, Matthew From xscript at gmx.net Tue Jun 28 08:34:07 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Tue, 28 Jun 2011 14:34:07 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: (Mark Wiebe's message of "Sat, 25 Jun 2011 14:56:44 -0500") References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: <87vcvqt9s0.fsf@ginnungagap.bsc.es> Mark Wiebe writes: > The design that's forming is a combination of: > * Solve the missing data problem? > * My ideas of what a good solution looks like: > ? ?* applies to all NumPy dtypes in a fully general way > ? ?* high-performance, low overhead where possible > ? ?* makes the C-level implementation of NumPy nicer to work with, not harder > ? ?* easy to use from Python for unskilled programmers > ? ?* easy to use more powerful functionality from Python for skilled programmers > ? ?* satisfies all or most of the needs of the many users of arrays with a "missing data" aspect to them I would add here an efficient mechanism to reinterpret exising data with different missing information (no copies of the backing array). Although I'm not sure whether this requires first-class citizenship or not. > * All the feedback I'm getting from discussions on the list [...] > I've updated a section "Parameterized Data Type With NA Signal Values" > in the NEP with an idea for now an NA bit pattern approach could > coexist and work together with the mask-based approach. I think I've > solved some of the generality and implementation obstacles, it would > be great to get some feedback on that. Some (obvious) thoughts about it: * Trivial to store, as the missing property is encoded in the value itself. * Third-party (non-Python) code needs some interface to interpret these without having to know the implementation details (although the interface is rather trivial). * Data marked as missing loses its original value. * Reinterpreting the same data (memory buffer) with different missing information requires either memory copies or separate mask arrays (see above) So, while it (data types with NA signal values) has its advantages on a simpler interaction with 3rd party code and during long-term storage, masks will still be needed. I think that deciding on the value of NA signal values boils down to this question: should 3rd party code be able to interpret missing data information stored in the separate mask array? If the answer is no, then 3rd party code should be given a copy of the data where the masked array is merged with the ndarray data buffer (assuming the original ndarray had a masked array before passing it to the 3rd party code). As by definition (?) the ndarray with a mask must retain the original data, the result of the 3rd party code must be translated back into an ndarray + mask. If the answer is yes, then I think the NA signal values just add unnecessary complexity, as the 3rd party code will already need to use some numpy-specific API to handle missing data through the ndarray buffer + mask buffer. This reminds me that if 3rd party were to use the new iterator interface, the interface could be twisted in a way that it returns only the non-missing parts. For the sake of performance, this could be optional, so that the default behaviour is to just iterate through non-missing data but an option can be used to iterate over all data, and leave missing data handling up to the 3rd party code. My 2 cents, Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From xscript at gmx.net Tue Jun 28 08:42:59 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Tue, 28 Jun 2011 14:42:59 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: (Charles R. Harris's message of "Sat, 25 Jun 2011 08:25:33 -0600") References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> Message-ID: <87liwmt9d8.fsf@ginnungagap.bsc.es> Charles R Harris writes: > I think we may need some standard format for masked data on disk if we > don't go the NA value route. As I see it, the mask array is just some metadata that is attached to the dtype descriptor. I don't know how an ndarray is (un)pickled from disk, but I imagine that each dtype descriptor can be (de)serialized. Thus this will also include the mask array. Note that if the mask array is metadata attached to the dtype, structured arrays (or however they're called nowadays) can have different mask arrays for each of the struct fields, or share them arbitrarily. In any case, pickling will take care of storing just once each of the mask arrays. This reminds me that I've wanted for long time to store extra metadata on the dtypes. What I've found out is that metadata is stored on the dtype that describes the structure of the whole array, not on the per-field dtype. This makes it harder to retain per-field metadata whenever you operate on a field-per-field basis. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From nadavh at visionsense.com Tue Jun 28 09:06:20 2011 From: nadavh at visionsense.com (Nadav Horesh) Date: Tue, 28 Jun 2011 06:06:20 -0700 Subject: [Numpy-discussion] =?windows-1255?b?+vnl4eQ6ICBSZWNvbW1uZGF0aW9u?= =?windows-1255?q?s_for_an_easy_GUI?= In-Reply-To: <710674A8-E245-4D40-933C-5400A6A03B4D@inria.fr> Message-ID: <26FC23E7C398A64083C980D16001012D246BAF6BCA@VA3DIAXVS361.RED001.local> I tried Root?s advice and with the get_data method and GTK (without Agg) I got decent speed -- 30 fps (display speed, without the calculations overhead). The combination of matplotlib and glumpy resulted in 88 fps. I think I?ll have a solutionif glumpy lack of documentation will net get in the way. Thank you all for the useful advices, Nadav. ________________________________ ???: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] ??? Nicolas Rougier ????: Tuesday, June 28, 2011 09:47 ??: Discussion of Numerical Python ????: Re: [Numpy-discussion] Recommndations for an easy GUI Have a look at glumpy: http://code.google.com/p/glumpy/ It's quite simple and very fast for images (it's based on OpenGL/shaders). Nicolas On Jun 28, 2011, at 6:38 AM, Nadav Horesh wrote: I have an application which generates and displays RGB images as rate of several frames/seconds (5-15). Currently I use Tkinter+PIL, but I have a problem that it slows down the rate significantly. I am looking for a fast and easy alternative. Platform: Linux I prefer tools that would work also with python3 I looked at the following: 1. matplotlib's imshow: Easy but too slow. 2. pyqwt: Easy, fast, does not support rgb images (only grayscale) 3. pygame: Nice but lacking widgets like buttons etc. 4. PyQT: Can do, I would like something simpler (I'll rather focus on computations, not GUI) Nadav. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Jun 28 11:06:02 2011 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 28 Jun 2011 08:06:02 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe wrote: > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett > wrote: >> You won't get complaints, you'll just lose a group of users, who will, >> I suspect, stick to NaNs, unsatisfactory as they are. > > This blade cuts both ways, we'd lose a group of users if we don't support > masking semantics, too. The problem is, that's inevitable. One might think that trying to find a compromise solution that picks a few key aspects of each approach would be a good way to make everyone happy, but in my experience, it mostly leads to systems that are a muddled mess and that make everyone unhappy. You're much better off saying screw it, these goals are in scope and those ones aren't, and we're going to build something consistent and powerful instead of focusing on how long the feature list is. That's also the problem with focusing too much on a list of use cases: you might capture everything on any single list, but there are actually an infinite variety of use cases that will arise in the future. If you can generalize beyond the use cases to find some simple and consistent mental model, and implement that, then that'll work for all those future use cases too. But sometimes that requires deciding what *not* to implement. Just my opinion, but it's fairly hard won. Anyway, it's pretty clear that in this particular case, there are two distinct features that different people want: the missing data feature, and the masked array feature. The more I think about it, the less I see how they can be combined into one dessert topping + floor wax solution. Here are three particular points where they seem to contradict each other: Missing data: We think memory usage is critical. The ideal solution has zero overhead. If we can't get that, then at the very least we want the overhead to be 1 bit/item instead of 1 byte/item. Masked arrays: We say, it's critical to have good ways to manipulate the masking array, share it between multiple arrays, and so forth. And numpy already has great support for all those things! So obviously the masking array should be exposed as a standard ndarray. Missing data: Once you've assigned NA to a value, you should *not* be able to get at what was stored there before. Masked arrays: You must be able to unmask a value and recover what was stored there before. (You might think, what difference does it make if you *can* unmask an item? Us missing data folks could just ignore this feature. But: whatever we end up implementing is something that I will have to explain over and over to different people, most of them not particularly sophisticated programmers. And there's just no sensible way to explain this idea that if you store some particular value, then it replaces the old value, but if you store NA, then the old value is still there. They will get confused, and then store it away as another example of how computers are arbitrary and confusing and they're just too dumb to understand them, and I *hate* doing that to people. Plus the more that happens, the more they end up digging themselves into some hole by trying things at random, and then I have to dig them out again. So the point is, we can go either way, but in both ways there *is* a cost, and we have to decide.) Missing data: It's critical that NAs propagate through reduction operations by default, though there should also be some way to turn this off. Masked arrays: Masked values should be silently ignored by reduction operations, and having to remember to pass a special flag to turn on this behavior on every single ufunc call would be a huge pain. (Masked array advocates: please correct me if I'm misrepresenting you anywhere above!) > That said, Travis favors doing both, so there's a good chance there will be > time for it. One issue with the current draft is that I don't see any addressing of how masking-missing and bit-pattern-missing interact: a = np.zeros(10, dtype="NA[f8]") a.flags.hasmask = True a[5] = np.NA # Now what? If you're going to implement both things anyway, and you need to figure out how they interact anyway, then why not split them up into two totally separate features? Here's my proposal: 1) Add a purely dtype-based support for missing data: 1.A) Add some flags/metadata to the dtype structure to let it describe what a missing value looks like for an element of its type. Something like, an example NA value plus a function that can be called to identify NAs when they occur in arrays. (Notice that this interface is general enough to handle both the bit-stealing approach and the maybe() approach.) 1.B) Add an np.NA object, and teach the various coercion loops to use the above fields in the dtype structure to handle it. 1.C) Teach the various reduction loops that if a particular flag is set in the dtype, then they also should check for NAs and handle them appropriately. (If this flag is not set, then it means that this dtype's ufunc loops are already NA aware and the generic machinery is not needed unless skipmissing=True is given. This is useful for user-defined dtypes, and probably also a nice optimization for floats using NaN.) 1.D) Finally, as a convenience, add some standard NA-aware dtypes. Personally, I wouldn't bother with complicated string-based mini-language described in the current NEP; just define some standard NA-enabled dtype objects in the numpy namespace or provide a function that takes a dtype + a NA bit-pattern and spits out an NA-enabled dtype or whatever. 2) Add a better masked array support. 2.A) Masked arrays are simply arrays with an extra attribute '.visible', which is an arbitrary numpy array that is broadcastable to the same shape as the masked array. There's no magic here -- if you say a.visible = b.visible, then they now share a visibility array, according to the ordinary rules of Python assignment. (Well, there needs to be some check for shape compatibility, but that's not much magic.) 2.B) To minimize confusion with the missing value support, the way you mask/unmask items is through expressions like 'a.visible[10] = False'; there is no magic np.masked object. (There are a few options for what happens when you try to use scalar indexing explicitly to extract an invisible value -- you could return the actual value from behind the mask, or throw an error, or return a scalar masked array whose .visible attribute was a scalar array containing False. I don't know what the people who actually use this stuff would prefer :-).) 2.C) Indexing and shape-changing operations on the masked array are automatically applied to the .visible array as well. (Attempting to call .resize() on an array which is being used as the .visible attribute of some other array is an error.) 2.D) Ufuncs on masked arrays always ignore invisible items. We can probably share some code here between the handling of skipmissing=True for NA-enabled dtypes and invisible items in masked arrays, but that's purely an implementation detail. This approach to masked arrays requires that the ufunc machinery have some special knowledge of what a masked array is, so masked arrays would have to become part of the core. I'm not sure whether or not they should be part of the np.ndarray base class or remain as a subclass, though. There's an argument that they're more of a convenience feature like np.matrix, and code which interfaces between ndarray's and C becomes more complicated if it has to be prepared to handle visibility. (Note that in contrast, ndarray's can already contain arbitrary user-defined dtypes, so the missing value support proposed here doesn't add any new issues to C interfacing.) So maybe it'd be better to leave it as a core supported subclass? Could go either way. -- Nathaniel From mesanthu at gmail.com Tue Jun 28 11:56:48 2011 From: mesanthu at gmail.com (santhu kumar) Date: Tue, 28 Jun 2011 10:56:48 -0500 Subject: [Numpy-discussion] SVD does not converge Message-ID: Hello, I have a 380X5 matrix and when I am calculating pseudo-inverse of the matrix using pinv(numpy.linalg) I get the following error message: raise LinAlgError, 'SVD did not converge' numpy.linalg.linalg.LinAlgError: SVD did not converge I have looked in the list that it is a recurring issue but I was unable to find any solution. Can somebody please guide me on how to fix that issue? Thanks Santhosh -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 28 12:26:48 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 28 Jun 2011 10:26:48 -0600 Subject: [Numpy-discussion] SVD does not converge In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 9:56 AM, santhu kumar wrote: > Hello, > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > matrix using pinv(numpy.linalg) I get the following error message: > > raise LinAlgError, 'SVD did not converge' > numpy.linalg.linalg.LinAlgError: SVD did not converge > > I have looked in the list that it is a recurring issue but I was unable to > find any solution. Can somebody please guide me on how to fix that issue? > > This is usually a problem with ATLAS. What distro/os are you using and where did the ATLAS, if that is what you are using, come from? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jun 28 11:59:29 2011 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 28 Jun 2011 17:59:29 +0200 Subject: [Numpy-discussion] SVD does not converge In-Reply-To: References: Message-ID: <1309276769.7388.19.camel@sebastian> Hi, On Tue, 2011-06-28 at 10:56 -0500, santhu kumar wrote: > Hello, > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > matrix using pinv(numpy.linalg) I get the following error message: > > raise LinAlgError, 'SVD did not converge' > numpy.linalg.linalg.LinAlgError: SVD did not converge > Just a quick guess, but is there any chance that your array includes a nan value? Regards, Sebastian > I have looked in the list that it is a recurring issue but I was > unable to find any solution. Can somebody please guide me on how to > fix that issue? > > Thanks > Santhosh > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From charlesr.harris at gmail.com Tue Jun 28 12:38:32 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 28 Jun 2011 10:38:32 -0600 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 9:06 AM, Nathaniel Smith wrote: > On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe wrote: > > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett > > > wrote: > >> You won't get complaints, you'll just lose a group of users, who will, > >> I suspect, stick to NaNs, unsatisfactory as they are. > > > > This blade cuts both ways, we'd lose a group of users if we don't support > > masking semantics, too. > > The problem is, that's inevitable. One might think that trying to find > a compromise solution that picks a few key aspects of each approach > would be a good way to make everyone happy, but in my experience, it > mostly leads to systems that are a muddled mess and that make everyone > unhappy. You're much better off saying screw it, these goals are in > scope and those ones aren't, and we're going to build something > consistent and powerful instead of focusing on how long the feature > list is. That's also the problem with focusing too much on a list of > use cases: you might capture everything on any single list, but there > are actually an infinite variety of use cases that will arise in the > future. If you can generalize beyond the use cases to find some simple > and consistent mental model, and implement that, then that'll work for > all those future use cases too. But sometimes that requires deciding > what *not* to implement. > > Just my opinion, but it's fairly hard won. > > Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: > > Missing data: We think memory usage is critical. The ideal solution > has zero overhead. If we can't get that, then at the very least we > want the overhead to be 1 bit/item instead of 1 byte/item. > Masked arrays: We say, it's critical to have good ways to manipulate > the masking array, share it between multiple arrays, and so forth. And > numpy already has great support for all those things! So obviously the > masking array should be exposed as a standard ndarray. > > Missing data: Once you've assigned NA to a value, you should *not* be > able to get at what was stored there before. > Masked arrays: You must be able to unmask a value and recover what was > stored there before. > > (You might think, what difference does it make if you *can* unmask an > item? Us missing data folks could just ignore this feature. But: > whatever we end up implementing is something that I will have to > explain over and over to different people, most of them not > particularly sophisticated programmers. And there's just no sensible > way to explain this idea that if you store some particular value, then > it replaces the old value, but if you store NA, then the old value is > still there. They will get confused, and then store it away as another > example of how computers are arbitrary and confusing and they're just > too dumb to understand them, and I *hate* doing that to people. Plus > the more that happens, the more they end up digging themselves into > some hole by trying things at random, and then I have to dig them out > again. So the point is, we can go either way, but in both ways there > *is* a cost, and we have to decide.) > > Missing data: It's critical that NAs propagate through reduction > operations by default, though there should also be some way to turn > this off. > Masked arrays: Masked values should be silently ignored by reduction > operations, and having to remember to pass a special flag to turn on > this behavior on every single ufunc call would be a huge pain. > > (Masked array advocates: please correct me if I'm misrepresenting you > anywhere above!) > > > That said, Travis favors doing both, so there's a good chance there will > be > > time for it. > > One issue with the current draft is that I don't see any addressing of > how masking-missing and bit-pattern-missing interact: > a = np.zeros(10, dtype="NA[f8]") > a.flags.hasmask = True > a[5] = np.NA # Now what? > > If you're going to implement both things anyway, and you need to > figure out how they interact anyway, then why not split them up into > two totally separate features? > > Here's my proposal: > 1) Add a purely dtype-based support for missing data: > 1.A) Add some flags/metadata to the dtype structure to let it describe > what a missing value looks like for an element of its type. Something > like, an example NA value plus a function that can be called to > identify NAs when they occur in arrays. (Notice that this interface is > general enough to handle both the bit-stealing approach and the > maybe() approach.) > 1.B) Add an np.NA object, and teach the various coercion loops to use > the above fields in the dtype structure to handle it. > 1.C) Teach the various reduction loops that if a particular flag is > set in the dtype, then they also should check for NAs and handle them > appropriately. (If this flag is not set, then it means that this > dtype's ufunc loops are already NA aware and the generic machinery is > not needed unless skipmissing=True is given. This is useful for > user-defined dtypes, and probably also a nice optimization for floats > using NaN.) > 1.D) Finally, as a convenience, add some standard NA-aware dtypes. > Personally, I wouldn't bother with complicated string-based > mini-language described in the current NEP; just define some standard > NA-enabled dtype objects in the numpy namespace or provide a function > that takes a dtype + a NA bit-pattern and spits out an NA-enabled > dtype or whatever. > > 2) Add a better masked array support. > 2.A) Masked arrays are simply arrays with an extra attribute > '.visible', which is an arbitrary numpy array that is broadcastable to > the same shape as the masked array. There's no magic here -- if you > say a.visible = b.visible, then they now share a visibility array, > according to the ordinary rules of Python assignment. (Well, there > needs to be some check for shape compatibility, but that's not much > magic.) > 2.B) To minimize confusion with the missing value support, the way you > mask/unmask items is through expressions like 'a.visible[10] = False'; > there is no magic np.masked object. (There are a few options for what > happens when you try to use scalar indexing explicitly to extract an > invisible value -- you could return the actual value from behind the > mask, or throw an error, or return a scalar masked array whose > .visible attribute was a scalar array containing False. I don't know > what the people who actually use this stuff would prefer :-).) > 2.C) Indexing and shape-changing operations on the masked array are > automatically applied to the .visible array as well. (Attempting to > call .resize() on an array which is being used as the .visible > attribute of some other array is an error.) > 2.D) Ufuncs on masked arrays always ignore invisible items. We can > probably share some code here between the handling of skipmissing=True > for NA-enabled dtypes and invisible items in masked arrays, but that's > purely an implementation detail. > > This approach to masked arrays requires that the ufunc machinery have > some special knowledge of what a masked array is, so masked arrays > would have to become part of the core. I'm not sure whether or not > they should be part of the np.ndarray base class or remain as a > subclass, though. There's an argument that they're more of a > convenience feature like np.matrix, and code which interfaces between > ndarray's and C becomes more complicated if it has to be prepared to > handle visibility. (Note that in contrast, ndarray's can already > contain arbitrary user-defined dtypes, so the missing value support > proposed here doesn't add any new issues to C interfacing.) So maybe > it'd be better to leave it as a core supported subclass? Could go > either way. > > Nathaniel, an implementation using masks will look *exactly* like an implementation using na-dtypes from the user's point of view. Except that taking a masked view of an unmasked array allows ignoring values without destroying or copying the original data. The only downside I can see to an implementation using masks is memory and disk storage, and perhaps memory mapped arrays. And I rather expect the former to solve itself in a few years, eight gigs is becoming a baseline for workstations and in a couple of years I expect that to be up around 16-32, and a few years after that.... In any case we are talking 12% - 25% overhead, and in practice I expect it won't be quite as big a problem as folks project. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From lou_boog2000 at yahoo.com Tue Jun 28 12:43:19 2011 From: lou_boog2000 at yahoo.com (Lou Pecora) Date: Tue, 28 Jun 2011 09:43:19 -0700 (PDT) Subject: [Numpy-discussion] SVD does not converge In-Reply-To: References: Message-ID: <1309279399.80253.YahooMailRC@web34405.mail.mud.yahoo.com> ________________________________ From: santhu kumar To: numpy-discussion at scipy.org Sent: Tue, June 28, 2011 11:56:48 AM Subject: [Numpy-discussion] SVD does not converge Hello, I have a 380X5 matrix and when I am calculating pseudo-inverse of the matrix using pinv(numpy.linalg) I get the following error message: raise LinAlgError, 'SVD did not converge' numpy.linalg.linalg.LinAlgError: SVD did not converge I have looked in the list that it is a recurring issue but I was unable to find any solution. Can somebody please guide me on how to fix that issue? Thanks Santhosh ============================================== I had a similar problem (although I wasn't looking for the pseudo inverse). I found that "squaring" the matrix fixed the problem. But I'm guessing in your situation that would mean a 380 x 380 matrix (I hope I'm thinking about your case correctly). But it's worth trying since it's easy to do. -- Lou Pecora, my views are my own. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jh at physics.ucf.edu Tue Jun 28 12:50:52 2011 From: jh at physics.ucf.edu (Joe Harrington) Date: Tue, 28 Jun 2011 12:50:52 -0400 Subject: [Numpy-discussion] Missing/accumulating data In-Reply-To: (numpy-discussion-request@scipy.org) References: Message-ID: As with Travis, I have not had time to wade through the 150+ messages on masked arrays, but I'd like to raise a concept I've mentioned in the past that would enable a broader use if done slightly differently. That is, the "masked array" problem is a subset of this more-general problem. Please respond on this thread if you want me to see the replies. The problem is of data accumulation and averaging, and it comes up all the time in astronomy, medical imaging, and other fields that handle stacks of images. Say that I observe a planet, taking images at, say, 30 Hz for several hours, as with many video cameras. I want to register or map those images to a common coordinate system, stack them, and average them. I must recognize: 1. transient there are bad pixels due to cosmic ray hits on the CCD array 2. some pixels are permanently dead and always produce bad data 3. some data in the image is not part of the planet 4. some images have too much atmospheric blurring to be used 5. sometimes moons of planets like Jupiter pass in front 6. etc. For each original image: Make a mask where 1 is good and 0 is bad. This trims both bad pixels and pixels that are not part of the planet. Evaluate the images and zero the entire image (or regions of an image) if quality is low (or there is an obstruction). Shift (or map, in the case of lon-lat mapping) the image AND THE MASK to a common coordinate grid. Append the image and the mask to the ends of the stacks of prior images and masks. For each x and y position in the image/mask stacks: Perform some averaging of all pixels in the stack that are not masked, or otherwise use the mask info to decide how to use the image info. The most common of these averages is to multiply the stack of images by the stack of masks, zeroing the bad data, then sum both images and masks down the third axis. This produces an accumulator image and a count image, where the count image says how many data points went into the accumulator at each pixel position. Divide the accumulator by the count. This produces an image with no missing data and the highest SNR per pixel possible. Two things differ from the current ma package: the sense of TRUE/FALSE and the ability to have numbers other than 0 or 1. For example, if data quality were part of the assessment, each pixel could get a data quality number and the accumulation would only include pixels with better than a certain number, or there could be weighted averaging. Currently, we just implement this as described, without using the masked array capability at all. However, it's cumbersome to hand around pairs of objects, and making a new object to handle them together means breaking it apart when calling standard routines that don't handle masks. There are probably some very fast ways to do interesting kinds of averages that we haven't explored but would if this kind of capability were in the infrastructure. So, I hope that if some kind of masking goes into the array object, it allows TRUE=good, FALSE=bad, and at least an integer count if not a floating quality number. One bit is not enough. Of course, these could be options. Certainly adding 7 or 31 more bits to the mask will make space problems for some, as would flipping the sense of the mask. However, speed matters. Some of those doing this kind of processing are handling hours of staring video data, which is 1e6 frames a night from each camera. --jh-- From njs at pobox.com Tue Jun 28 13:26:36 2011 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 28 Jun 2011 10:26:36 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 9:38 AM, Charles R Harris wrote: > Nathaniel, an implementation using masks will look *exactly* like an > implementation using na-dtypes from the user's point of view. Except that > taking a masked view of an unmasked array allows ignoring values without > destroying or copying the original data. Charles, I know that :-). But if that view thing is an advertised feature -- in fact, the key selling point for the masking-based implementation, included specifically to make a significant contingent of users happy -- then it's certainly user-visible. And it will make other users unhappy, like I said. That's life. But who cares? My main point is that implementing a missing data solution and a separate masked array solution is probably less work than implementing a single everything-to-everybody solution *anyway*, *and* it might make both sets of users happier too. Notice that in my proposal, there's really nothing there that isn't already in Mark's NEP in some form or another, but in my version there's almost no overlap between the two features. That's not because I was trying to make them artificially different; it's because I tried to think of the most natural ways to satisfy each set of use cases, and they're just different. -- Nathaniel From charlesr.harris at gmail.com Tue Jun 28 13:34:38 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 28 Jun 2011 11:34:38 -0600 Subject: [Numpy-discussion] Missing/accumulating data In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 10:50 AM, Joe Harrington wrote: > As with Travis, I have not had time to wade through the 150+ messages > on masked arrays, but I'd like to raise a concept I've mentioned in > the past that would enable a broader use if done slightly differently. > That is, the "masked array" problem is a subset of this more-general > problem. Please respond on this thread if you want me to see the > replies. > > The problem is of data accumulation and averaging, and it comes up all > the time in astronomy, medical imaging, and other fields that handle > stacks of images. Say that I observe a planet, taking images at, say, > 30 Hz for several hours, as with many video cameras. I want to > register or map those images to a common coordinate system, stack > them, and average them. I must recognize: > > 1. transient there are bad pixels due to cosmic ray hits on the CCD array > 2. some pixels are permanently dead and always produce bad data > 3. some data in the image is not part of the planet > 4. some images have too much atmospheric blurring to be used > 5. sometimes moons of planets like Jupiter pass in front > 6. etc. > > For each original image: > Make a mask where 1 is good and 0 is bad. This trims both bad > pixels and pixels that are not part of the planet. > Evaluate the images and zero the entire image (or regions of an > image) if quality is low (or there is an obstruction). > Shift (or map, in the case of lon-lat mapping) the image AND THE > MASK to a common coordinate grid. > Append the image and the mask to the ends of the stacks of prior > images and masks. > > For each x and y position in the image/mask stacks: > Perform some averaging of all pixels in the stack that are not masked, > or otherwise use the mask info to decide how to use the image info. > > The most common of these averages is to multiply the stack of images > by the stack of masks, zeroing the bad data, then sum both images and > masks down the third axis. This produces an accumulator image and a > count image, where the count image says how many data points went into > the accumulator at each pixel position. Divide the accumulator by the > count. This produces an image with no missing data and the highest > SNR per pixel possible. > > Two things differ from the current ma package: the sense of TRUE/FALSE > and the ability to have numbers other than 0 or 1. For example, if > data quality were part of the assessment, each pixel could get a data > quality number and the accumulation would only include pixels with > better than a certain number, or there could be weighted averaging. > > Currently, we just implement this as described, without using the > masked array capability at all. However, it's cumbersome to hand > around pairs of objects, and making a new object to handle them > together means breaking it apart when calling standard routines that > don't handle masks. There are probably some very fast ways to do > interesting kinds of averages that we haven't explored but would if > this kind of capability were in the infrastructure. > > So, I hope that if some kind of masking goes into the array object, it > allows TRUE=good, FALSE=bad, and at least an integer count if not a > floating quality number. One bit is not enough. Of course, these > could be options. Certainly adding 7 or 31 more bits to the mask will > make space problems for some, as would flipping the sense of the > mask. However, speed matters. Some of those doing this kind of > processing are handling hours of staring video data, which is 1e6 > frames a night from each camera. > > This might be described as masks as weights, or perhaps from the Bayesian point of view, as measures of just how unknown unknown data is. One thing I like about masks is that they allow ideas like this to creep in and thus encourage future innovations. But as you also point out, 8 bits might be a bit on the low side for that and a consistent idea of how to interpret things would help ala the missing or unknown models. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From e.antero.tammi at gmail.com Tue Jun 28 13:36:18 2011 From: e.antero.tammi at gmail.com (eat) Date: Tue, 28 Jun 2011 20:36:18 +0300 Subject: [Numpy-discussion] SVD does not converge In-Reply-To: <1309279399.80253.YahooMailRC@web34405.mail.mud.yahoo.com> References: <1309279399.80253.YahooMailRC@web34405.mail.mud.yahoo.com> Message-ID: Hi, On Tue, Jun 28, 2011 at 7:43 PM, Lou Pecora wrote: > > ------------------------------ > *From:* santhu kumar > *To:* numpy-discussion at scipy.org > *Sent:* Tue, June 28, 2011 11:56:48 AM > *Subject:* [Numpy-discussion] SVD does not converge > > Hello, > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > matrix using pinv(numpy.linalg) I get the following error message: > > raise LinAlgError, 'SVD did not converge' > numpy.linalg.linalg.LinAlgError: SVD did not converge > > I have looked in the list that it is a recurring issue but I was unable to > find any solution. Can somebody please guide me on how to fix that issue? > > Thanks > Santhosh > ============================================== > > I had a similar problem (although I wasn't looking for the pseudo inverse). > I found that "squaring" the matrix fixed the problem. But I'm guessing in > your situation that would mean a 380 x 380 matrix (I hope I'm thinking about > your case correctly). But it's worth trying since it's easy to do. > With my rusty linear algebra: if one chooses to proceed with this 'squaring' avenue, wouldn't it then be more economical to base the calculations on a square 5x5 matrix? Something like: A_pinv= dot(A, pinv(dot(A.T, A))).T Instead of a 380x380 based matrix: A_pinv= dot(pinv(dot(A, A.T)), A).T My two cents - eat > > -- Lou Pecora, my views are my own. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 28 13:43:21 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 28 Jun 2011 11:43:21 -0600 Subject: [Numpy-discussion] SVD does not converge In-Reply-To: References: <1309279399.80253.YahooMailRC@web34405.mail.mud.yahoo.com> Message-ID: On Tue, Jun 28, 2011 at 11:36 AM, eat wrote: > Hi, > > On Tue, Jun 28, 2011 at 7:43 PM, Lou Pecora wrote: > >> >> ------------------------------ >> *From:* santhu kumar >> *To:* numpy-discussion at scipy.org >> *Sent:* Tue, June 28, 2011 11:56:48 AM >> *Subject:* [Numpy-discussion] SVD does not converge >> >> Hello, >> >> I have a 380X5 matrix and when I am calculating pseudo-inverse of the >> matrix using pinv(numpy.linalg) I get the following error message: >> >> raise LinAlgError, 'SVD did not converge' >> numpy.linalg.linalg.LinAlgError: SVD did not converge >> >> I have looked in the list that it is a recurring issue but I was unable to >> find any solution. Can somebody please guide me on how to fix that issue? >> >> Thanks >> Santhosh >> ============================================== >> >> I had a similar problem (although I wasn't looking for the pseudo >> inverse). I found that "squaring" the matrix fixed the problem. But I'm >> guessing in your situation that would mean a 380 x 380 matrix (I hope I'm >> thinking about your case correctly). But it's worth trying since it's easy >> to do. >> > With my rusty linear algebra: if one chooses to proceed with this > 'squaring' avenue, wouldn't it then be more economical to base the > calculations on a square 5x5 matrix? Something like: > A_pinv= dot(A, pinv(dot(A.T, A))).T > Instead of a 380x380 based matrix: > A_pinv= dot(pinv(dot(A, A.T)), A).T > > Yes, or similarly, start with a QR decomposition and work with R. In fact, the SVD algorithm we use first does QR, then reduces R to tridiagonal form (IIRC), and proceeds from there. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mesanthu at gmail.com Tue Jun 28 13:46:08 2011 From: mesanthu at gmail.com (santhu kumar) Date: Tue, 28 Jun 2011 12:46:08 -0500 Subject: [Numpy-discussion] SVD does not converge Message-ID: Hello After looking at the suggestions I have checked my matrix again and found that the matrix was wrong.(Has some NAN's and other values which are wrong). After correcting it it works fine. Sorry for the mistake and thanks for your time, Santhosh On Tue, Jun 28, 2011 at 11:43 AM, wrote: > Send NumPy-Discussion mailing list submissions to > numpy-discussion at scipy.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mail.scipy.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at scipy.org > > You can reach the person managing the list at > numpy-discussion-owner at scipy.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. SVD does not converge (santhu kumar) > 2. Re: SVD does not converge (Charles R Harris) > 3. Re: SVD does not converge (Sebastian Berg) > 4. Re: missing data discussion round 2 (Charles R Harris) > 5. Re: SVD does not converge (Lou Pecora) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 28 Jun 2011 10:56:48 -0500 > From: santhu kumar > Subject: [Numpy-discussion] SVD does not converge > To: numpy-discussion at scipy.org > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > matrix > using pinv(numpy.linalg) I get the following error message: > > raise LinAlgError, 'SVD did not converge' > numpy.linalg.linalg.LinAlgError: SVD did not converge > > I have looked in the list that it is a recurring issue but I was unable to > find any solution. Can somebody please guide me on how to fix that issue? > > Thanks > Santhosh > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110628/fb2eb5ac/attachment-0001.html > > ------------------------------ > > Message: 2 > Date: Tue, 28 Jun 2011 10:26:48 -0600 > From: Charles R Harris > Subject: Re: [Numpy-discussion] SVD does not converge > To: Discussion of Numerical Python > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Jun 28, 2011 at 9:56 AM, santhu kumar wrote: > > > Hello, > > > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > > matrix using pinv(numpy.linalg) I get the following error message: > > > > raise LinAlgError, 'SVD did not converge' > > numpy.linalg.linalg.LinAlgError: SVD did not converge > > > > I have looked in the list that it is a recurring issue but I was unable > to > > find any solution. Can somebody please guide me on how to fix that issue? > > > > > This is usually a problem with ATLAS. What distro/os are you using and > where > did the ATLAS, if that is what you are using, come from? > > Chuck > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110628/049af0a5/attachment-0001.html > > ------------------------------ > > Message: 3 > Date: Tue, 28 Jun 2011 17:59:29 +0200 > From: Sebastian Berg > Subject: Re: [Numpy-discussion] SVD does not converge > To: Discussion of Numerical Python > Message-ID: <1309276769.7388.19.camel at sebastian> > Content-Type: text/plain; charset="UTF-8" > > Hi, > > On Tue, 2011-06-28 at 10:56 -0500, santhu kumar wrote: > > Hello, > > > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > > matrix using pinv(numpy.linalg) I get the following error message: > > > > raise LinAlgError, 'SVD did not converge' > > numpy.linalg.linalg.LinAlgError: SVD did not converge > > > Just a quick guess, but is there any chance that your array includes a > nan value? > > Regards, > > Sebastian > > > I have looked in the list that it is a recurring issue but I was > > unable to find any solution. Can somebody please guide me on how to > > fix that issue? > > > > Thanks > > Santhosh > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > ------------------------------ > > Message: 4 > Date: Tue, 28 Jun 2011 10:38:32 -0600 > From: Charles R Harris > Subject: Re: [Numpy-discussion] missing data discussion round 2 > To: Discussion of Numerical Python > Message-ID: > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Jun 28, 2011 at 9:06 AM, Nathaniel Smith wrote: > > > On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe wrote: > > > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett < > matthew.brett at gmail.com > > > > > > wrote: > > >> You won't get complaints, you'll just lose a group of users, who will, > > >> I suspect, stick to NaNs, unsatisfactory as they are. > > > > > > This blade cuts both ways, we'd lose a group of users if we don't > support > > > masking semantics, too. > > > > The problem is, that's inevitable. One might think that trying to find > > a compromise solution that picks a few key aspects of each approach > > would be a good way to make everyone happy, but in my experience, it > > mostly leads to systems that are a muddled mess and that make everyone > > unhappy. You're much better off saying screw it, these goals are in > > scope and those ones aren't, and we're going to build something > > consistent and powerful instead of focusing on how long the feature > > list is. That's also the problem with focusing too much on a list of > > use cases: you might capture everything on any single list, but there > > are actually an infinite variety of use cases that will arise in the > > future. If you can generalize beyond the use cases to find some simple > > and consistent mental model, and implement that, then that'll work for > > all those future use cases too. But sometimes that requires deciding > > what *not* to implement. > > > > Just my opinion, but it's fairly hard won. > > > > Anyway, it's pretty clear that in this particular case, there are two > > distinct features that different people want: the missing data > > feature, and the masked array feature. The more I think about it, the > > less I see how they can be combined into one dessert topping + floor > > wax solution. Here are three particular points where they seem to > > contradict each other: > > > > Missing data: We think memory usage is critical. The ideal solution > > has zero overhead. If we can't get that, then at the very least we > > want the overhead to be 1 bit/item instead of 1 byte/item. > > Masked arrays: We say, it's critical to have good ways to manipulate > > the masking array, share it between multiple arrays, and so forth. And > > numpy already has great support for all those things! So obviously the > > masking array should be exposed as a standard ndarray. > > > > Missing data: Once you've assigned NA to a value, you should *not* be > > able to get at what was stored there before. > > Masked arrays: You must be able to unmask a value and recover what was > > stored there before. > > > > (You might think, what difference does it make if you *can* unmask an > > item? Us missing data folks could just ignore this feature. But: > > whatever we end up implementing is something that I will have to > > explain over and over to different people, most of them not > > particularly sophisticated programmers. And there's just no sensible > > way to explain this idea that if you store some particular value, then > > it replaces the old value, but if you store NA, then the old value is > > still there. They will get confused, and then store it away as another > > example of how computers are arbitrary and confusing and they're just > > too dumb to understand them, and I *hate* doing that to people. Plus > > the more that happens, the more they end up digging themselves into > > some hole by trying things at random, and then I have to dig them out > > again. So the point is, we can go either way, but in both ways there > > *is* a cost, and we have to decide.) > > > > Missing data: It's critical that NAs propagate through reduction > > operations by default, though there should also be some way to turn > > this off. > > Masked arrays: Masked values should be silently ignored by reduction > > operations, and having to remember to pass a special flag to turn on > > this behavior on every single ufunc call would be a huge pain. > > > > (Masked array advocates: please correct me if I'm misrepresenting you > > anywhere above!) > > > > > That said, Travis favors doing both, so there's a good chance there > will > > be > > > time for it. > > > > One issue with the current draft is that I don't see any addressing of > > how masking-missing and bit-pattern-missing interact: > > a = np.zeros(10, dtype="NA[f8]") > > a.flags.hasmask = True > > a[5] = np.NA # Now what? > > > > If you're going to implement both things anyway, and you need to > > figure out how they interact anyway, then why not split them up into > > two totally separate features? > > > > Here's my proposal: > > 1) Add a purely dtype-based support for missing data: > > 1.A) Add some flags/metadata to the dtype structure to let it describe > > what a missing value looks like for an element of its type. Something > > like, an example NA value plus a function that can be called to > > identify NAs when they occur in arrays. (Notice that this interface is > > general enough to handle both the bit-stealing approach and the > > maybe() approach.) > > 1.B) Add an np.NA object, and teach the various coercion loops to use > > the above fields in the dtype structure to handle it. > > 1.C) Teach the various reduction loops that if a particular flag is > > set in the dtype, then they also should check for NAs and handle them > > appropriately. (If this flag is not set, then it means that this > > dtype's ufunc loops are already NA aware and the generic machinery is > > not needed unless skipmissing=True is given. This is useful for > > user-defined dtypes, and probably also a nice optimization for floats > > using NaN.) > > 1.D) Finally, as a convenience, add some standard NA-aware dtypes. > > Personally, I wouldn't bother with complicated string-based > > mini-language described in the current NEP; just define some standard > > NA-enabled dtype objects in the numpy namespace or provide a function > > that takes a dtype + a NA bit-pattern and spits out an NA-enabled > > dtype or whatever. > > > > 2) Add a better masked array support. > > 2.A) Masked arrays are simply arrays with an extra attribute > > '.visible', which is an arbitrary numpy array that is broadcastable to > > the same shape as the masked array. There's no magic here -- if you > > say a.visible = b.visible, then they now share a visibility array, > > according to the ordinary rules of Python assignment. (Well, there > > needs to be some check for shape compatibility, but that's not much > > magic.) > > 2.B) To minimize confusion with the missing value support, the way you > > mask/unmask items is through expressions like 'a.visible[10] = False'; > > there is no magic np.masked object. (There are a few options for what > > happens when you try to use scalar indexing explicitly to extract an > > invisible value -- you could return the actual value from behind the > > mask, or throw an error, or return a scalar masked array whose > > .visible attribute was a scalar array containing False. I don't know > > what the people who actually use this stuff would prefer :-).) > > 2.C) Indexing and shape-changing operations on the masked array are > > automatically applied to the .visible array as well. (Attempting to > > call .resize() on an array which is being used as the .visible > > attribute of some other array is an error.) > > 2.D) Ufuncs on masked arrays always ignore invisible items. We can > > probably share some code here between the handling of skipmissing=True > > for NA-enabled dtypes and invisible items in masked arrays, but that's > > purely an implementation detail. > > > > This approach to masked arrays requires that the ufunc machinery have > > some special knowledge of what a masked array is, so masked arrays > > would have to become part of the core. I'm not sure whether or not > > they should be part of the np.ndarray base class or remain as a > > subclass, though. There's an argument that they're more of a > > convenience feature like np.matrix, and code which interfaces between > > ndarray's and C becomes more complicated if it has to be prepared to > > handle visibility. (Note that in contrast, ndarray's can already > > contain arbitrary user-defined dtypes, so the missing value support > > proposed here doesn't add any new issues to C interfacing.) So maybe > > it'd be better to leave it as a core supported subclass? Could go > > either way. > > > > > Nathaniel, an implementation using masks will look *exactly* like an > implementation using na-dtypes from the user's point of view. Except that > taking a masked view of an unmasked array allows ignoring values without > destroying or copying the original data. The only downside I can see to an > implementation using masks is memory and disk storage, and perhaps memory > mapped arrays. And I rather expect the former to solve itself in a few > years, eight gigs is becoming a baseline for workstations and in a couple > of > years I expect that to be up around 16-32, and a few years after that.... > In > any case we are talking 12% - 25% overhead, and in practice I expect it > won't be quite as big a problem as folks project. > > Chuck > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110628/ec281918/attachment-0001.html > > ------------------------------ > > Message: 5 > Date: Tue, 28 Jun 2011 09:43:19 -0700 (PDT) > From: Lou Pecora > Subject: Re: [Numpy-discussion] SVD does not converge > To: Discussion of Numerical Python > Message-ID: <1309279399.80253.YahooMailRC at web34405.mail.mud.yahoo.com> > Content-Type: text/plain; charset="us-ascii" > > > > > ________________________________ > From: santhu kumar > To: numpy-discussion at scipy.org > Sent: Tue, June 28, 2011 11:56:48 AM > Subject: [Numpy-discussion] SVD does not converge > > Hello, > > I have a 380X5 matrix and when I am calculating pseudo-inverse of the > matrix > using pinv(numpy.linalg) I get the following error message: > > raise LinAlgError, 'SVD did not converge' > numpy.linalg.linalg.LinAlgError: SVD did not converge > > I have looked in the list that it is a recurring issue but I was unable to > find > any solution. Can somebody please guide me on how to fix that issue? > > Thanks > Santhosh > ============================================== > > I had a similar problem (although I wasn't looking for the pseudo inverse). > I > found that "squaring" the matrix fixed the problem. But I'm guessing in > your > situation that would mean a 380 x 380 matrix (I hope I'm thinking about > your > case correctly). But it's worth trying since it's easy to do. > -- Lou Pecora, my views are my own. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110628/d3b4129f/attachment.html > > ------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > End of NumPy-Discussion Digest, Vol 57, Issue 153 > ************************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From RadimRehurek at seznam.cz Tue Jun 28 14:03:05 2011 From: RadimRehurek at seznam.cz (RadimRehurek) Date: Tue, 28 Jun 2011 20:03:05 +0200 (CEST) Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 57, Issue 155 In-Reply-To: Message-ID: <27255.7797.18661-27464-1152165041-1309284185@seznam.cz> Hello, > From: santhu kumar > > After looking at the suggestions I have checked my matrix again and found > that the matrix was wrong.(Has some NAN's and other values which are wrong). > After correcting it it works fine. Not to beat a dead horse, but here's an (ancient) related numpy ticket: http://projects.scipy.org/numpy/ticket/706 It may not be your case here, but this bug bit me in the past as well (OS X with vecLib). For what it's worth, my solution was what Lou suggested (squaring the input). Best, Radim From efiring at hawaii.edu Tue Jun 28 15:41:06 2011 From: efiring at hawaii.edu (Eric Firing) Date: Tue, 28 Jun 2011 09:41:06 -1000 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0A2E52.4020806@hawaii.edu> On 06/28/2011 07:26 AM, Nathaniel Smith wrote: > On Tue, Jun 28, 2011 at 9:38 AM, Charles R Harris > wrote: >> Nathaniel, an implementation using masks will look *exactly* like an >> implementation using na-dtypes from the user's point of view. Except that >> taking a masked view of an unmasked array allows ignoring values without >> destroying or copying the original data. > > Charles, I know that :-). > > But if that view thing is an advertised feature -- in fact, the key > selling point for the masking-based implementation, included > specifically to make a significant contingent of users happy -- then > it's certainly user-visible. And it will make other users unhappy, > like I said. That's life. > > But who cares? My main point is that implementing a missing data > solution and a separate masked array solution is probably less work > than implementing a single everything-to-everybody solution *anyway*, > *and* it might make both sets of users happier too. Notice that in my > proposal, there's really nothing there that isn't already in Mark's > NEP in some form or another, but in my version there's almost no > overlap between the two features. That's not because I was trying to > make them artificially different; it's because I tried to think of the > most natural ways to satisfy each set of use cases, and they're just > different. I think you are exaggerating some of the differences associated with the implementation, and ignoring one *key* difference: for integer types, the masked implementation can handle the full numeric range of the type, while the bit-pattern approach cannot. Balanced against that, the *key* advantages of the bit-pattern approach would seem to be the simplicity of using a single array, particularly for IO (including memmapping) and interfacing with extension code. Although I am a heavy user of masked arrays, I consider these bit-pattern advantages to be substantial and deserving of careful consideration--perhaps of more weight and planning than they have gotten so far. Datasets on disk--e.g. climatological data, numerical model output, etc.--typically do use reserved values as missing value flags, although occasionally one also finds separate mask arrays. One of the real frustrations of the present masked array is that there is no savez/load support. I could roll my own by using a convention like saving the mask of xxx as xxx__mask__, and then reversing the process in a modified load; but I haven't gotten around to doing it. Regardless of internal implementation, I hope that core support for missing values will be included in savez/load. Eric > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From njs at pobox.com Tue Jun 28 16:09:23 2011 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 28 Jun 2011 13:09:23 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A2E52.4020806@hawaii.edu> References: <4E0A2E52.4020806@hawaii.edu> Message-ID: On Tue, Jun 28, 2011 at 12:41 PM, Eric Firing wrote: > I think you are exaggerating some of the differences associated with the > implementation, and ignoring one *key* difference: for integer types, > the masked implementation can handle the full numeric range of the type, > while the bit-pattern approach cannot. You can get something semantically equivalent to the masked implementation by adding some extra bits and then stealing those. (That was the original "maybe(...)" idea.) My proposal would make it easy to implement either (either for us or for users, if we decide we don't want to clutter up the numpy core with too many pre-canned NA implementations). Doing this would give up either memory or speed versus both the separate-mask approach and the purely-bit-stealing approaches, but I don't know if anyone cares about NA support in integers *that* much -- personally I want it to be possible, because count data is important in statistics, but I don't really care how efficient it is. Floating point is much more important in practice. (Heck, R usually uses doubles for count data too -- you can't get an integer without an explicit cast.) -- Nathaniel From pgmdevlist at gmail.com Tue Jun 28 16:37:26 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Tue, 28 Jun 2011 22:37:26 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A2E52.4020806@hawaii.edu> References: <4E0A2E52.4020806@hawaii.edu> Message-ID: On Jun 28, 2011, at 9:41 PM, Eric Firing wrote: > > One of the real frustrations of the present masked array is that there > is no savez/load support. I could roll my own by using a convention > like saving the mask of xxx as xxx__mask__, and then reversing the > process in a modified load; but I haven't gotten around to doing it. > Regardless of internal implementation, I hope that core support for > missing values will be included in savez/load. Transform your masked array into a structured one (one field for the data, one for the mask), using the .torecords method, and use it as a regular ndarray. From pgmdevlist at gmail.com Tue Jun 28 16:45:42 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Tue, 28 Jun 2011 22:45:42 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> All, I'm not sure I understand some aspects of Mark's new proposal, sorry (blame the lack of sleep). I'm pretty excited with the idea of built-in NA like np.dtype(NA['float64']), provided we can come with some shortcuts like np.nafloat64. I think that would really take care of the missing data part in a consistent and non-ambiguous way. However, I understand that if a choice would be made, this approach would be dropped for the most generic "mask way", right ? (By "mask way", I mean something that is close (but actually optimized) to thenumpy.ma approach). So, taking this example >>> np.add(a, b, out=b, mask=(a > threshold)) If 'b' doesn't already have a mask, masked values will be lost if we go the mask way ? But kept if we go the bit way ? I prefer the latter, then Another advantage I see in the "bit-way' is that it's pretty close to the 'hardmask' idea. You'll never risk to lose the mask as it's already "burned" in the array... And now for something not that completely different: * Would it be possible to store internally the addresses of the NAs only to save some space (in the metadata ?) and when the .mask or .valid property is called, to still get a boolean array with the same shape as the underlying array ? From matthew.brett at gmail.com Tue Jun 28 17:52:01 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 28 Jun 2011 22:52:01 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Tue, Jun 28, 2011 at 5:38 PM, Charles R Harris wrote: > Nathaniel, an implementation using masks will look *exactly* like an > implementation using na-dtypes from the user's point of view. Except that > taking a masked view of an unmasked array allows ignoring values without > destroying or copying the original data. The only downside I can see to an > implementation using masks is memory and disk storage, and perhaps memory > mapped arrays. And I rather expect the former to solve itself in a few > years, eight gigs is becoming a baseline for workstations and in a couple of > years I expect that to be up around 16-32, and a few years after that.... In > any case we are talking 12% - 25% overhead, and in practice I expect it > won't be quite as big a problem as folks project. Or, in the case of 16 bit integers, 50% memory overhead. I honestly find it hard to believe that I will not care about memory use in the near future, and I don't think it's wise to make decisions on that assumption. Best, Matthew From matthew.brett at gmail.com Tue Jun 28 17:59:51 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 28 Jun 2011 22:59:51 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A2E52.4020806@hawaii.edu> References: <4E0A2E52.4020806@hawaii.edu> Message-ID: Hi, On Tue, Jun 28, 2011 at 8:41 PM, Eric Firing wrote: > On 06/28/2011 07:26 AM, Nathaniel Smith wrote: >> On Tue, Jun 28, 2011 at 9:38 AM, Charles R Harris >> ?wrote: >>> Nathaniel, an implementation using masks will look *exactly* like an >>> implementation using na-dtypes from the user's point of view. Except that >>> taking a masked view of an unmasked array allows ignoring values without >>> destroying or copying the original data. >> >> Charles, I know that :-). >> >> But if that view thing is an advertised feature -- in fact, the key >> selling point for the masking-based implementation, included >> specifically to make a significant contingent of users happy -- then >> it's certainly user-visible. And it will make other users unhappy, >> like I said. That's life. >> >> But who cares? My main point is that implementing a missing data >> solution and a separate masked array solution is probably less work >> than implementing a single everything-to-everybody solution *anyway*, >> *and* it might make both sets of users happier too. Notice that in my >> proposal, there's really nothing there that isn't already in Mark's >> NEP in some form or another, but in my version there's almost no >> overlap between the two features. That's not because I was trying to >> make them artificially different; it's because I tried to think of the >> most natural ways to satisfy each set of use cases, and they're just >> different. > > I think you are exaggerating some of the differences associated with the > implementation, and ignoring one *key* difference: for integer types, > the masked implementation can handle the full numeric range of the type, > while the bit-pattern approach cannot. Losing the most negative value in an int16 doesn't seem too much, but I agree losing a value in int8 might be annoying. On the other hand, maybe it's OK if we don't suport NAs for int8. See you, Matthew From matthew.brett at gmail.com Tue Jun 28 18:20:08 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 28 Jun 2011 23:20:08 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: ... > (You might think, what difference does it make if you *can* unmask an > item? Us missing data folks could just ignore this feature. But: > whatever we end up implementing is something that I will have to > explain over and over to different people, most of them not > particularly sophisticated programmers. And there's just no sensible > way to explain this idea that if you store some particular value, then > it replaces the old value, but if you store NA, then the old value is > still there. Ouch - yes. No question, that is difficult to explain. Well, I think the explanation might go like this: "Ah, yes, well, that's because in fact numpy records missing values by using a 'mask'. So when you say `a[3] = np.NA', what you mean is, 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" Is that fair? See you, Matthew From jason-sage at creativetrax.com Tue Jun 28 18:40:53 2011 From: jason-sage at creativetrax.com (Jason Grout) Date: Tue, 28 Jun 2011 17:40:53 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0A5875.8070803@creativetrax.com> On 6/28/11 5:20 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > ... >> (You might think, what difference does it make if you *can* unmask an >> item? Us missing data folks could just ignore this feature. But: >> whatever we end up implementing is something that I will have to >> explain over and over to different people, most of them not >> particularly sophisticated programmers. And there's just no sensible >> way to explain this idea that if you store some particular value, then >> it replaces the old value, but if you store NA, then the old value is >> still there. > > Ouch - yes. No question, that is difficult to explain. Well, I > think the explanation might go like this: > > "Ah, yes, well, that's because in fact numpy records missing values by > using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > Is that fair? Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys the idea that the entry is still there, but we're just ignoring it. Of course, that goes against common convention, but it might be easier to explain. Thanks, Jason From matthew.brett at gmail.com Tue Jun 28 19:00:04 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 00:00:04 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A5875.8070803@creativetrax.com> References: <4E0A5875.8070803@creativetrax.com> Message-ID: Hi, On Tue, Jun 28, 2011 at 11:40 PM, Jason Grout wrote: > On 6/28/11 5:20 PM, Matthew Brett wrote: >> Hi, >> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith ?wrote: >> ... >>> (You might think, what difference does it make if you *can* unmask an >>> item? Us missing data folks could just ignore this feature. But: >>> whatever we end up implementing is something that I will have to >>> explain over and over to different people, most of them not >>> particularly sophisticated programmers. And there's just no sensible >>> way to explain this idea that if you store some particular value, then >>> it replaces the old value, but if you store NA, then the old value is >>> still there. >> >> Ouch - yes. ?No question, that is difficult to explain. ? Well, I >> think the explanation might go like this: >> >> "Ah, yes, well, that's because in fact numpy records missing values by >> using a 'mask'. ? So when you say `a[3] = np.NA', what you mean is, >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" >> >> Is that fair? > > Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > the idea that the entry is still there, but we're just ignoring it. ?Of > course, that goes against common convention, but it might be easier to > explain. I think Nathaniel's point is that np.IGNORE is a different idea than np.NA, and that is why joining the implementations can lead to conceptual confusion. For example, for: a = np.array([np.NA, 1]) you might expect the result of a.sum() to be np.NA. That's what it is in R. However for: b = np.array([np.IGNORE, 1]) you'd probably expect b.sum() to be 1. That's what it is for masked_array currently. The current proposal fuses these two ideas with one implementation. Quoting from the NEP: >>> a = np.array([1., 3., np.NA, 7.], masked=True) >>> np.sum(a) array(NA, dtype='>> np.sum(a, skipna=True) 11.0 I agree with Nathaniel, that there is no practical way of avoiding the full 'NAs are in fact values where theres a False in the mask' concept, and that does impose a serious conceptual cost on the 'NA' user. Best, Matthew From e.antero.tammi at gmail.com Tue Jun 28 19:00:43 2011 From: e.antero.tammi at gmail.com (eat) Date: Wed, 29 Jun 2011 02:00:43 +0300 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A5875.8070803@creativetrax.com> References: <4E0A5875.8070803@creativetrax.com> Message-ID: Hi, On Wed, Jun 29, 2011 at 1:40 AM, Jason Grout wrote: > On 6/28/11 5:20 PM, Matthew Brett wrote: > > Hi, > > > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > > ... > >> (You might think, what difference does it make if you *can* unmask an > >> item? Us missing data folks could just ignore this feature. But: > >> whatever we end up implementing is something that I will have to > >> explain over and over to different people, most of them not > >> particularly sophisticated programmers. And there's just no sensible > >> way to explain this idea that if you store some particular value, then > >> it replaces the old value, but if you store NA, then the old value is > >> still there. > > > > Ouch - yes. No question, that is difficult to explain. Well, I > > think the explanation might go like this: > > > > "Ah, yes, well, that's because in fact numpy records missing values by > > using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > > 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > > > Is that fair? > > Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > the idea that the entry is still there, but we're just ignoring it. Of > course, that goes against common convention, but it might be easier to > explain. > Somehow very similar approach how I always have treated the NaNs. (Thus postponing all the real (slightly dirty) work on to the imputation procedures). For me it has been sufficient to ignore what's the actual cause of NaNs. But I believe there exists plenty other much more sophisticated situations where this kind of simple treatment is not sufficient, at all. Anyway, even in the future it should still be possible to play nicely with these kind of simple scenarios. - eat > > Thanks, > > Jason > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 19:28:38 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 18:28:38 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 10:06 AM, Nathaniel Smith wrote: > On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe wrote: > > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett > > > wrote: > >> You won't get complaints, you'll just lose a group of users, who will, > >> I suspect, stick to NaNs, unsatisfactory as they are. > > > > This blade cuts both ways, we'd lose a group of users if we don't support > > masking semantics, too. > > The problem is, that's inevitable. One might think that trying to find > a compromise solution that picks a few key aspects of each approach > would be a good way to make everyone happy, but in my experience, it > mostly leads to systems that are a muddled mess and that make everyone > unhappy. You're much better off saying screw it, these goals are in > scope and those ones aren't, and we're going to build something > consistent and powerful instead of focusing on how long the feature > list is. That's also the problem with focusing too much on a list of > use cases: you might capture everything on any single list, but there > are actually an infinite variety of use cases that will arise in the > future. If you can generalize beyond the use cases to find some simple > and consistent mental model, and implement that, then that'll work for > all those future use cases too. But sometimes that requires deciding > what *not* to implement. > > Just my opinion, but it's fairly hard won. > I don't think the solution I'm proposing is any of muddled, a mess, or a compromise. There are still rough edges to be worked out, but that's the nature of the design process. Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. I don't believe these are different, it seems to me that people wanting masked arrays want missing data without touching their data in nearly all their use cases. If people have use cases that can't be handled with this approach, it would be nice to have specific examples the current NEP fails to address. > The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: > > Missing data: We think memory usage is critical. The ideal solution > has zero overhead. If we can't get that, then at the very least we > want the overhead to be 1 bit/item instead of 1 byte/item. > Masked arrays: We say, it's critical to have good ways to manipulate > the masking array, share it between multiple arrays, and so forth. And > numpy already has great support for all those things! So obviously the > masking array should be exposed as a standard ndarray. > > Missing data: Once you've assigned NA to a value, you should *not* be > able to get at what was stored there before. > Masked arrays: You must be able to unmask a value and recover what was > stored there before. > My current proposal provides both of these at the same time. It does not allow you to unmask a value without also setting the value stored in its element memory. (You might think, what difference does it make if you *can* unmask an > item? Us missing data folks could just ignore this feature. But: > whatever we end up implementing is something that I will have to > explain over and over to different people, most of them not > particularly sophisticated programmers. And there's just no sensible > way to explain this idea that if you store some particular value, then > it replaces the old value, but if you store NA, then the old value is > still there. They will get confused, and then store it away as another > example of how computers are arbitrary and confusing and they're just > too dumb to understand them, and I *hate* doing that to people. Plus > the more that happens, the more they end up digging themselves into > some hole by trying things at random, and then I have to dig them out > again. So the point is, we can go either way, but in both ways there > *is* a cost, and we have to decide.) > > Missing data: It's critical that NAs propagate through reduction > operations by default, though there should also be some way to turn > this off. > Masked arrays: Masked values should be silently ignored by reduction > operations, and having to remember to pass a special flag to turn on > this behavior on every single ufunc call would be a huge pain. > This isn't a difference between "missing data" and "masked arrays", you're describing the two ways of thinking about missing data in the NEP. (Masked array advocates: please correct me if I'm misrepresenting you > anywhere above!) > > > That said, Travis favors doing both, so there's a good chance there will > be > > time for it. > > One issue with the current draft is that I don't see any addressing of > how masking-missing and bit-pattern-missing interact: > a = np.zeros(10, dtype="NA[f8]") > a.flags.hasmask = True > a[5] = np.NA # Now what? > Yes, this is a rough edge that needs to be worked out. Probably the array mask should be more primal. Another question is if you add an array with a mask and an array with an "NA[]" dtype, which missing data mechanism should be produced. > If you're going to implement both things anyway, and you need to > figure out how they interact anyway, then why not split them up into > two totally separate features? > Because they're both approaches for dealing with missing data. -Mark Here's my proposal: > 1) Add a purely dtype-based support for missing data: > 1.A) Add some flags/metadata to the dtype structure to let it describe > what a missing value looks like for an element of its type. Something > like, an example NA value plus a function that can be called to > identify NAs when they occur in arrays. (Notice that this interface is > general enough to handle both the bit-stealing approach and the > maybe() approach.) > 1.B) Add an np.NA object, and teach the various coercion loops to use > the above fields in the dtype structure to handle it. > 1.C) Teach the various reduction loops that if a particular flag is > set in the dtype, then they also should check for NAs and handle them > appropriately. (If this flag is not set, then it means that this > dtype's ufunc loops are already NA aware and the generic machinery is > not needed unless skipmissing=True is given. This is useful for > user-defined dtypes, and probably also a nice optimization for floats > using NaN.) > 1.D) Finally, as a convenience, add some standard NA-aware dtypes. > Personally, I wouldn't bother with complicated string-based > mini-language described in the current NEP; just define some standard > NA-enabled dtype objects in the numpy namespace or provide a function > that takes a dtype + a NA bit-pattern and spits out an NA-enabled > dtype or whatever. > > 2) Add a better masked array support. > 2.A) Masked arrays are simply arrays with an extra attribute > '.visible', which is an arbitrary numpy array that is broadcastable to > the same shape as the masked array. There's no magic here -- if you > say a.visible = b.visible, then they now share a visibility array, > according to the ordinary rules of Python assignment. (Well, there > needs to be some check for shape compatibility, but that's not much > magic.) > 2.B) To minimize confusion with the missing value support, the way you > mask/unmask items is through expressions like 'a.visible[10] = False'; > there is no magic np.masked object. (There are a few options for what > happens when you try to use scalar indexing explicitly to extract an > invisible value -- you could return the actual value from behind the > mask, or throw an error, or return a scalar masked array whose > .visible attribute was a scalar array containing False. I don't know > what the people who actually use this stuff would prefer :-).) > 2.C) Indexing and shape-changing operations on the masked array are > automatically applied to the .visible array as well. (Attempting to > call .resize() on an array which is being used as the .visible > attribute of some other array is an error.) > 2.D) Ufuncs on masked arrays always ignore invisible items. We can > probably share some code here between the handling of skipmissing=True > for NA-enabled dtypes and invisible items in masked arrays, but that's > purely an implementation detail. > > This approach to masked arrays requires that the ufunc machinery have > some special knowledge of what a masked array is, so masked arrays > would have to become part of the core. I'm not sure whether or not > they should be part of the np.ndarray base class or remain as a > subclass, though. There's an argument that they're more of a > convenience feature like np.matrix, and code which interfaces between > ndarray's and C becomes more complicated if it has to be prepared to > handle visibility. (Note that in contrast, ndarray's can already > contain arbitrary user-defined dtypes, so the missing value support > proposed here doesn't add any new issues to C interfacing.) So maybe > it'd be better to leave it as a core supported subclass? Could go > either way. > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 19:30:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 18:30:17 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0A2E52.4020806@hawaii.edu> References: <4E0A2E52.4020806@hawaii.edu> Message-ID: On Tue, Jun 28, 2011 at 2:41 PM, Eric Firing wrote: > On 06/28/2011 07:26 AM, Nathaniel Smith wrote: > > On Tue, Jun 28, 2011 at 9:38 AM, Charles R Harris > > wrote: > >> Nathaniel, an implementation using masks will look *exactly* like an > >> implementation using na-dtypes from the user's point of view. Except > that > >> taking a masked view of an unmasked array allows ignoring values without > >> destroying or copying the original data. > > > > Charles, I know that :-). > > > > But if that view thing is an advertised feature -- in fact, the key > > selling point for the masking-based implementation, included > > specifically to make a significant contingent of users happy -- then > > it's certainly user-visible. And it will make other users unhappy, > > like I said. That's life. > > > > But who cares? My main point is that implementing a missing data > > solution and a separate masked array solution is probably less work > > than implementing a single everything-to-everybody solution *anyway*, > > *and* it might make both sets of users happier too. Notice that in my > > proposal, there's really nothing there that isn't already in Mark's > > NEP in some form or another, but in my version there's almost no > > overlap between the two features. That's not because I was trying to > > make them artificially different; it's because I tried to think of the > > most natural ways to satisfy each set of use cases, and they're just > > different. > > I think you are exaggerating some of the differences associated with the > implementation, and ignoring one *key* difference: for integer types, > the masked implementation can handle the full numeric range of the type, > while the bit-pattern approach cannot. > > Balanced against that, the *key* advantages of the bit-pattern approach > would seem to be the simplicity of using a single array, particularly > for IO (including memmapping) and interfacing with extension code. > Although I am a heavy user of masked arrays, I consider these > bit-pattern advantages to be substantial and deserving of careful > consideration--perhaps of more weight and planning than they have gotten > so far. > > Datasets on disk--e.g. climatological data, numerical model output, > etc.--typically do use reserved values as missing value flags, although > occasionally one also finds separate mask arrays. > > One of the real frustrations of the present masked array is that there > is no savez/load support. I could roll my own by using a convention > like saving the mask of xxx as xxx__mask__, and then reversing the > process in a modified load; but I haven't gotten around to doing it. > Regardless of internal implementation, I hope that core support for > missing values will be included in savez/load. > This sounds reasonable to me, and probably will require extending the file format a bit. -Mark > > Eric > > > > > > > -- Nathaniel > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 19:37:17 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 18:37:17 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> References: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> Message-ID: On Tue, Jun 28, 2011 at 3:45 PM, Pierre GM wrote: > All, > I'm not sure I understand some aspects of Mark's new proposal, sorry (blame > the lack of sleep). > I'm pretty excited with the idea of built-in NA like > np.dtype(NA['float64']), provided we can come with some shortcuts like > np.nafloat64. This could be created at NumPy startup time, no problem. > I think that would really take care of the missing data part in a > consistent and non-ambiguous way. > However, I understand that if a choice would be made, this approach would > be dropped for the most generic "mask way", right ? (By "mask way", I mean > something that is close (but actually optimized) to thenumpy.ma approach). > The NEP proposes strict NA missing value semantics, where the only way to get at the masked values is by having another view that doesn't have the value masked. If someone has use cases where this prevents some functionality they need, I'd love to hear them. So, taking this example > >>> np.add(a, b, out=b, mask=(a > threshold)) > If 'b' doesn't already have a mask, masked values will be lost if we go the > mask way ? But kept if we go the bit way ? I prefer the latter, then > Another advantage I see in the "bit-way' is that it's pretty close to the > 'hardmask' idea. You'll never risk to lose the mask as it's already "burned" > in the array... > I've nearly finished this parameter, and decided to call it 'where' instead, because it is operating like an SQL where clause. Here if neither a nor b are masked array it will only modify those values of b where the 'where' parameter has the value True. And now for something not that completely different: > * Would it be possible to store internally the addresses of the NAs only to > save some space (in the metadata ?) and when the .mask or .valid property is > called, to still get a boolean array with the same shape as the underlying > array ? > Something like this could be possible, but would certainly complicate the implementation. If it were desired, it would be a follow-up feature. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 19:39:31 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 18:39:31 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > ... > > (You might think, what difference does it make if you *can* unmask an > > item? Us missing data folks could just ignore this feature. But: > > whatever we end up implementing is something that I will have to > > explain over and over to different people, most of them not > > particularly sophisticated programmers. And there's just no sensible > > way to explain this idea that if you store some particular value, then > > it replaces the old value, but if you store NA, then the old value is > > still there. > > Ouch - yes. No question, that is difficult to explain. Well, I > think the explanation might go like this: > > "Ah, yes, well, that's because in fact numpy records missing values by > using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > Is that fair? > My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 19:42:39 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 18:42:39 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> Message-ID: On Tue, Jun 28, 2011 at 6:00 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 11:40 PM, Jason Grout > wrote: > > On 6/28/11 5:20 PM, Matthew Brett wrote: > >> Hi, > >> > >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > >> ... > >>> (You might think, what difference does it make if you *can* unmask an > >>> item? Us missing data folks could just ignore this feature. But: > >>> whatever we end up implementing is something that I will have to > >>> explain over and over to different people, most of them not > >>> particularly sophisticated programmers. And there's just no sensible > >>> way to explain this idea that if you store some particular value, then > >>> it replaces the old value, but if you store NA, then the old value is > >>> still there. > >> > >> Ouch - yes. No question, that is difficult to explain. Well, I > >> think the explanation might go like this: > >> > >> "Ah, yes, well, that's because in fact numpy records missing values by > >> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > >> > >> Is that fair? > > > > Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > > the idea that the entry is still there, but we're just ignoring it. Of > > course, that goes against common convention, but it might be easier to > > explain. > > I think Nathaniel's point is that np.IGNORE is a different idea than > np.NA, and that is why joining the implementations can lead to > conceptual confusion. For example, for: > > a = np.array([np.NA, 1]) > > you might expect the result of a.sum() to be np.NA. That's what it is > in R. However for: > > b = np.array([np.IGNORE, 1]) > > you'd probably expect b.sum() to be 1. That's what it is for > masked_array currently. > > The current proposal fuses these two ideas with one implementation. > Quoting from the NEP: > > >>> a = np.array([1., 3., np.NA, 7.], masked=True) > >>> np.sum(a) > array(NA, dtype=' >>> np.sum(a, skipna=True) > 11.0 > > I agree with Nathaniel, that there is no practical way of avoiding the > full 'NAs are in fact values where theres a False in the mask' > concept, and that does impose a serious conceptual cost on the 'NA' > user. > I'm not sure where the conceptual cost is coming from. If you're using missing values with the masked array implementation, all you see are missing value semantics. To see the additional masking behavior you have to deal with more than one view of the same data at the same time, something that is in and of itself already advanced for the novice user. -Mark > > Best, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Jun 28 19:55:12 2011 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 28 Jun 2011 16:55:12 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> Message-ID: On Tue, Jun 28, 2011 at 4:37 PM, Mark Wiebe wrote: > I've nearly finished this parameter, and decided to call it 'where' instead, > because it is operating like an SQL where clause. Here if neither a nor b > are masked array it will only modify those values of b where the 'where' > parameter has the value True. Probably a good idea, since in the current NEP, masks do not have a 'where' clause effect (at least for reduction operations), unless you pass skipna=True. So calling it mask= would be a bit confusing :-) So if I understand right, f(a, b, where=c) is basically an optimized (copy-avoiding) version of f(a[c], b[c]) ? Except, in this case, c must be a boolean array of exactly the right shape, you don't support broadcasting or integer arrays or slices, etc.? -- Nathaniel From pgmdevlist at gmail.com Tue Jun 28 19:56:52 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Wed, 29 Jun 2011 01:56:52 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> Message-ID: On Jun 29, 2011, at 1:37 AM, Mark Wiebe wrote: > On Tue, Jun 28, 2011 at 3:45 PM, Pierre GM wrote: > ... > > I think that would really take care of the missing data part in a consistent and non-ambiguous way. > However, I understand that if a choice would be made, this approach would be dropped for the most generic "mask way", right ? (By "mask way", I mean something that is close (but actually optimized) to thenumpy.ma approach). > > The NEP proposes strict NA missing value semantics, where the only way to get at the masked values is by having another view that doesn't have the value masked. If someone has use cases where this prevents some functionality they need, I'd love to hear them. Mmh... Would you have an example ? I haven't caught up with my lack of sleep yet... > > So, taking this example > >>> np.add(a, b, out=b, mask=(a > threshold)) > If 'b' doesn't already have a mask, masked values will be lost if we go the mask way ? But kept if we go the bit way ? I prefer the latter, then > Another advantage I see in the "bit-way' is that it's pretty close to the 'hardmask' idea. You'll never risk to lose the mask as it's already "burned" in the array... > > I've nearly finished this parameter, and decided to call it 'where' instead, because it is operating like an SQL where clause. Here if neither a nor b are masked array it will only modify those values of b where the 'where' parameter has the value True. OK, sounds fine. Pretty fine, actually. Just to be clear, if 'out' is not defined, the result is a masked array with 'where' as mask. What's the value below the mask ? np.NA ? > And now for something not that completely different: > * Would it be possible to store internally the addresses of the NAs only to save some space (in the metadata ?) and when the .mask or .valid property is called, to still get a boolean array with the same shape as the underlying array ? > > Something like this could be possible, but would certainly complicate the implementation. If it were desired, it would be a follow-up feature. Oh, no problem. I was suggesting a way to save some space, but if it's too tricky to implement, forget it. From pgmdevlist at gmail.com Tue Jun 28 19:57:48 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Wed, 29 Jun 2011 01:57:48 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <73145626-86F1-43E5-96F9-5CA894E395AB@gmail.com> On Jun 29, 2011, at 1:39 AM, Mark Wiebe wrote: > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > ... > > (You might think, what difference does it make if you *can* unmask an > > item? Us missing data folks could just ignore this feature. But: > > whatever we end up implementing is something that I will have to > > explain over and over to different people, most of them not > > particularly sophisticated programmers. And there's just no sensible > > way to explain this idea that if you store some particular value, then > > it replaces the old value, but if you store NA, then the old value is > > still there. > > Ouch - yes. No question, that is difficult to explain. Well, I > think the explanation might go like this: > > "Ah, yes, well, that's because in fact numpy records missing values by > using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > Is that fair? > > My favorite way of explaining it would be to have a grid of numbers written on paper, then have several cardboards with holes poked in them in different configurations. Placing these cardboard masks in front of the grid would show different sets of non-missing data, without affecting the values stored on the paper behind them. And when there's a hole (or just a blank) in your piece of paper ? From mwwiebe at gmail.com Tue Jun 28 20:42:46 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 19:42:46 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> Message-ID: On Tue, Jun 28, 2011 at 6:55 PM, Nathaniel Smith wrote: > On Tue, Jun 28, 2011 at 4:37 PM, Mark Wiebe wrote: > > I've nearly finished this parameter, and decided to call it 'where' > instead, > > because it is operating like an SQL where clause. Here if neither a nor b > > are masked array it will only modify those values of b where the 'where' > > parameter has the value True. > > Probably a good idea, since in the current NEP, masks do not have a > 'where' clause effect (at least for reduction operations), unless you > pass skipna=True. So calling it mask= would be a bit confusing :-) > > So if I understand right, > f(a, b, where=c) > is basically an optimized (copy-avoiding) version of > f(a[c], b[c]) > ? Yes, of 'b[c] = f(a[c], b[c])'. > Except, in this case, c must be a boolean array of exactly the right > shape, you don't support broadcasting or integer arrays or slices, > etc.? > It supports broadcasting, but not arrays of indices. -Mark > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 20:47:51 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 19:47:51 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <7E888D0E-0422-474B-BACD-B3B1A5C51CBA@gmail.com> Message-ID: On Tue, Jun 28, 2011 at 6:56 PM, Pierre GM wrote: > > On Jun 29, 2011, at 1:37 AM, Mark Wiebe wrote: > > > On Tue, Jun 28, 2011 at 3:45 PM, Pierre GM wrote: > > ... > > > > I think that would really take care of the missing data part in a > consistent and non-ambiguous way. > > However, I understand that if a choice would be made, this approach would > be dropped for the most generic "mask way", right ? (By "mask way", I mean > something that is close (but actually optimized) to thenumpy.ma approach). > > > > The NEP proposes strict NA missing value semantics, where the only way to > get at the masked values is by having another view that doesn't have the > value masked. If someone has use cases where this prevents some > functionality they need, I'd love to hear them. > > Mmh... Would you have an example ? I haven't caught up with my lack of > sleep yet... Sure, I'll copy one I made for the NEP for starters: >>> a = np.array([1,2]) >>> b = a.view() >>> b.flags.hasmask = True >>> b array([1,2], masked=True) >>> b[0] = np.NA >>> b array([NA,2], masked=True) >>> a array([1,2]) >>> # The underlying number 1 value in 'a[0]' was untouched At this point, there is no way to access the number 1 value using 'b', just using 'a'. If we assign to 'b[0]' it will also change a[0]: >>> b[0] = 3 >>> a array([3,2]) > > > > So, taking this example > > >>> np.add(a, b, out=b, mask=(a > threshold)) > > If 'b' doesn't already have a mask, masked values will be lost if we go > the mask way ? But kept if we go the bit way ? I prefer the latter, then > > Another advantage I see in the "bit-way' is that it's pretty close to the > 'hardmask' idea. You'll never risk to lose the mask as it's already "burned" > in the array... > > > > I've nearly finished this parameter, and decided to call it 'where' > instead, because it is operating like an SQL where clause. Here if neither a > nor b are masked array it will only modify those values of b where the > 'where' parameter has the value True. > > OK, sounds fine. Pretty fine, actually. Just to be clear, if 'out' is not > defined, the result is a masked array with 'where' as mask. What's the value > below the mask ? np.NA ? > The value below the mask is like the result of np.empty(), and with strict missing value semantics, it shouldn't be possible to ever get at it. (except for breaking the rules from C code). > And now for something not that completely different: > > * Would it be possible to store internally the addresses of the NAs only > to save some space (in the metadata ?) and when the .mask or .valid property > is called, to still get a boolean array with the same shape as the > underlying array ? > > > > Something like this could be possible, but would certainly complicate the > implementation. If it were desired, it would be a follow-up feature. > > Oh, no problem. I was suggesting a way to save some space, but if it's too > tricky to implement, forget it. > Cool, I think the semantics with views of masks might not work either. With a bit-level mask the views would still be possible but more complicated than normal views. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Tue Jun 28 20:49:40 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 28 Jun 2011 19:49:40 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <73145626-86F1-43E5-96F9-5CA894E395AB@gmail.com> References: <73145626-86F1-43E5-96F9-5CA894E395AB@gmail.com> Message-ID: On Tue, Jun 28, 2011 at 6:57 PM, Pierre GM wrote: > > On Jun 29, 2011, at 1:39 AM, Mark Wiebe wrote: > > > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett > wrote: > > Hi, > > > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > > ... > > > (You might think, what difference does it make if you *can* unmask an > > > item? Us missing data folks could just ignore this feature. But: > > > whatever we end up implementing is something that I will have to > > > explain over and over to different people, most of them not > > > particularly sophisticated programmers. And there's just no sensible > > > way to explain this idea that if you store some particular value, then > > > it replaces the old value, but if you store NA, then the old value is > > > still there. > > > > Ouch - yes. No question, that is difficult to explain. Well, I > > think the explanation might go like this: > > > > "Ah, yes, well, that's because in fact numpy records missing values by > > using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > > 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > > > Is that fair? > > > > My favorite way of explaining it would be to have a grid of numbers > written on paper, then have several cardboards with holes poked in them in > different configurations. Placing these cardboard masks in front of the grid > would show different sets of non-missing data, without affecting the values > stored on the paper behind them. > > And when there's a hole (or just a blank) in your piece of paper ? A hole means an unmasked element, no hole means a masked element. -Mark > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.s.seljebotn at astro.uio.no Wed Jun 29 02:53:43 2011 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 29 Jun 2011 08:53:43 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0ACBF7.2080400@astro.uio.no> On 06/28/2011 11:52 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 5:38 PM, Charles R Harris > wrote: >> Nathaniel, an implementation using masks will look *exactly* like an >> implementation using na-dtypes from the user's point of view. Except that >> taking a masked view of an unmasked array allows ignoring values without >> destroying or copying the original data. The only downside I can see to an >> implementation using masks is memory and disk storage, and perhaps memory >> mapped arrays. And I rather expect the former to solve itself in a few >> years, eight gigs is becoming a baseline for workstations and in a couple of >> years I expect that to be up around 16-32, and a few years after that.... In >> any case we are talking 12% - 25% overhead, and in practice I expect it >> won't be quite as big a problem as folks project. > > Or, in the case of 16 bit integers, 50% memory overhead. > > I honestly find it hard to believe that I will not care about memory > use in the near future, and I don't think it's wise to make decisions > on that assumption. In many sciences, waiting for the future makes things worse, not better, simply because the amount of available data easily grows at a faster rate than the amount of memory you can get per dollar :-) Dag Sverre From d.s.seljebotn at astro.uio.no Wed Jun 29 03:26:03 2011 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 29 Jun 2011 09:26:03 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0AD38B.5030404@astro.uio.no> On 06/27/2011 05:55 PM, Mark Wiebe wrote: > First I'd like to thank everyone for all the feedback you're providing, > clearly this is an important topic to many people, and the discussion > has helped clarify the ideas for me. I've renamed and updated the NEP, > then placed it into the master NumPy repository so it has a more > permanent home here: > > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst One thing to think about is the presence of SSE/AVX instructions, which has the potential to change some of the memory/speed trade-offs here. In the newest Intel-platform CPUs you can do 256-bit operations, translating to a theoretical factor 8 speedup for in-cache single precision data, and the instruction set is constructed for future expansion possibilites to 512 or 1024 bit registers. I feel one should take care to not design oneself into a corner where this can't (eventually) be leveraged. 1) The shuffle instructions takes a single byte as a control character for moving around data in different ways in 128-bit registers. One could probably implement fast IGNORE-style NA with a seperate mask using 1 byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not sure if 1 byte per element kind of mask would be that fast (but I don't know much about this and haven't looked at the details). 2) The alternative "Parameterized Data Type Which Adds Additional Memory for the NA Flag" would mean that contiguous arrays with NA's/IGNORE's would not be subject to vector instructions, or create a mess of copying in and out prior to operating on the data. This really seems like the worst of all possibilites to me. (FWIW, my vote is in favour of both NA-using-NaN and IGNORE-using-explicit-masks, and keep the two as entirely seperate worlds to avoid confusion.) Dag Sverre From xscript at gmx.net Wed Jun 29 09:20:57 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Wed, 29 Jun 2011 15:20:57 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: (Matthew Brett's message of "Wed, 29 Jun 2011 00:00:04 +0100") References: <4E0A5875.8070803@creativetrax.com> Message-ID: <87oc1gg4ee.fsf@ginnungagap.bsc.es> Matthew Brett writes: >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys >> the idea that the entry is still there, but we're just ignoring it. ?Of >> course, that goes against common convention, but it might be easier to >> explain. > I think Nathaniel's point is that np.IGNORE is a different idea than > np.NA, and that is why joining the implementations can lead to > conceptual confusion. This is how I see it: >>> a = np.array([0, 1, 2], dtype=int) >>> a[0] = np.NA ValueError >>> e = np.array([np.NA, 1, 2], dtype=int) ValueError >>> b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) >>> m = np.array([np.NA, 1, 2], dtype=int, masked=True) >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) >>> b[1] = np.NA >>> np.sum(b) np.NA >>> np.sum(b, skipna=True) 2 >>> b.mask None >>> m[1] = np.NA >>> np.sum(m) 2 >>> np.sum(m, skipna=True) 2 >>> m.mask [False, False, True] >>> bm[1] = np.NA >>> np.sum(bm) 2 >>> np.sum(bm, skipna=True) 2 >>> bm.mask [False, False, True] So: * Mask takes precedence over bit pattern on element assignment. There's still the question of how to assign a bit pattern NA when the mask is active. * When using mask, elements are automagically skipped. * "m[1] = np.NA" is equivalent to "m.mask[1] = False" * When using bit pattern + mask, it might make sense to have the initial values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, False, True]" and "np.sum(bm) == np.NA") Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From matthew.brett at gmail.com Wed Jun 29 09:45:13 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 14:45:13 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe wrote: > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: >> ... >> > (You might think, what difference does it make if you *can* unmask an >> > item? Us missing data folks could just ignore this feature. But: >> > whatever we end up implementing is something that I will have to >> > explain over and over to different people, most of them not >> > particularly sophisticated programmers. And there's just no sensible >> > way to explain this idea that if you store some particular value, then >> > it replaces the old value, but if you store NA, then the old value is >> > still there. >> >> Ouch - yes. ?No question, that is difficult to explain. ? Well, I >> think the explanation might go like this: >> >> "Ah, yes, well, that's because in fact numpy records missing values by >> using a 'mask'. ? So when you say `a[3] = np.NA', what you mean is, >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" >> >> Is that fair? > > My favorite way of explaining it would be to have a grid of numbers written > on paper, then have several cardboards with holes poked in them in different > configurations. Placing these cardboard masks in front of the grid would > show different sets of non-missing data, without affecting the values stored > on the paper behind them. Right - but here of course you are trying to explain the mask, and this is Nathaniel's point, that in order to explain NAs, you have to explain masks, and so, even at a basic level, the fusion of the two ideas is obvious, and already confusing. I mean this: a[3] = np.NA "Oh, so you just set the a[3] value to have some missing value code?" "Ah - no - in fact what I did was set a associated mask in position a[3] so that you can't any longer see the previous value of a[3]" "Huh. You mean I have a mask for every single value in order to be able to blank out a[3]? It looks like an assignment. I mean, it looks just like a[3] = 4. But I guess it isn't?" "Er..." I think Nathaniel's point is a very good one - these are separate ideas, np.NA and np.IGNORE, and a joint implementation is bound to draw them together in the mind of the user. Apart from anything else, the user has to know that, if they want a single NA value in an array, they have to add a mask size array.shape in bytes. They have to know then, that NA is implemented by masking, and then the 'NA for free by adding masking' idea breaks down and starts to feel like a kludge. The counter argument is of course that, in time, the implementation of NA with masking will seem as obvious and intuitive, as, say, broadcasting, and that we are just reacting from lack of experience with the new API. Of course, that does happen, but here, unless I am mistaken, the primary drive to fuse NA and masking is because of ease of implementation. That doesn't necessarily mean that they don't go together - if something is easy to implement, sometimes it means it will also feel natural in use, but at least we might say that there is some risk of the implementation driving the API, and that that can lead to problems. See you, Matthew From mlist at re-factory.de Wed Jun 29 10:32:14 2011 From: mlist at re-factory.de (Robert Elsner) Date: Wed, 29 Jun 2011 16:32:14 +0200 Subject: [Numpy-discussion] Multiply along axis Message-ID: <4E0B376E.6010706@re-factory.de> Hello everyone, I would like to solve the following problem (preferably without reshaping / flipping the array a). Assume I have a vector v of length x and an n-dimensional array a where one dimension has length x as well. Now I would like to multiply the vector v along a given axis of a. Some example code a = np.random.random((2,3)) x = np.zeros(2) a * x # Fails because not broadcastable So how do I multiply x with the columns of a so that for each column j a[:,j] = a[:,j] * x without using a loop. Is there some (fast) fast way to accomplish that with numpy/scipy? Thanks for your help Robert Elsner From d.s.seljebotn at astro.uio.no Wed Jun 29 10:35:14 2011 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 29 Jun 2011 16:35:14 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0B3822.5010908@astro.uio.no> On 06/29/2011 03:45 PM, Matthew Brett wrote: > Hi, > > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe wrote: >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett >> wrote: >>> >>> Hi, >>> >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: >>> ... >>>> (You might think, what difference does it make if you *can* unmask an >>>> item? Us missing data folks could just ignore this feature. But: >>>> whatever we end up implementing is something that I will have to >>>> explain over and over to different people, most of them not >>>> particularly sophisticated programmers. And there's just no sensible >>>> way to explain this idea that if you store some particular value, then >>>> it replaces the old value, but if you store NA, then the old value is >>>> still there. >>> >>> Ouch - yes. No question, that is difficult to explain. Well, I >>> think the explanation might go like this: >>> >>> "Ah, yes, well, that's because in fact numpy records missing values by >>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" >>> >>> Is that fair? >> >> My favorite way of explaining it would be to have a grid of numbers written >> on paper, then have several cardboards with holes poked in them in different >> configurations. Placing these cardboard masks in front of the grid would >> show different sets of non-missing data, without affecting the values stored >> on the paper behind them. > > Right - but here of course you are trying to explain the mask, and > this is Nathaniel's point, that in order to explain NAs, you have to > explain masks, and so, even at a basic level, the fusion of the two > ideas is obvious, and already confusing. I mean this: > > a[3] = np.NA > > "Oh, so you just set the a[3] value to have some missing value code?" > > "Ah - no - in fact what I did was set a associated mask in position > a[3] so that you can't any longer see the previous value of a[3]" > > "Huh. You mean I have a mask for every single value in order to be > able to blank out a[3]? It looks like an assignment. I mean, it > looks just like a[3] = 4. But I guess it isn't?" > > "Er..." > > I think Nathaniel's point is a very good one - these are separate > ideas, np.NA and np.IGNORE, and a joint implementation is bound to > draw them together in the mind of the user. Apart from anything > else, the user has to know that, if they want a single NA value in an > array, they have to add a mask size array.shape in bytes. They have > to know then, that NA is implemented by masking, and then the 'NA for > free by adding masking' idea breaks down and starts to feel like a > kludge. > > The counter argument is of course that, in time, the implementation of > NA with masking will seem as obvious and intuitive, as, say, > broadcasting, and that we are just reacting from lack of experience > with the new API. However, no matter how used we get to this, people coming from almost any other tool (in particular R) will keep think it is counter-intuitive. Why set up a major semantic incompatability that people then have to overcome in order to start using NumPy. I really don't see what's wrong with some more explicit API like a.mask[3] = True. "Explicit is better than implicit". Dag Sverre From mlist at re-factory.de Wed Jun 29 10:35:59 2011 From: mlist at re-factory.de (Robert Elsner) Date: Wed, 29 Jun 2011 16:35:59 +0200 Subject: [Numpy-discussion] Multiply along axis In-Reply-To: <4E0B376E.6010706@re-factory.de> References: <4E0B376E.6010706@re-factory.de> Message-ID: <4E0B384F.9030702@re-factory.de> Oh and I forgot to mention: I want to specify the axis so that it is possible to multiply x along an arbitrary axis of a (given that the lengths match). On 29.06.2011 16:32, Robert Elsner wrote: > Hello everyone, > > I would like to solve the following problem (preferably without > reshaping / flipping the array a). > > Assume I have a vector v of length x and an n-dimensional array a where > one dimension has length x as well. Now I would like to multiply the > vector v along a given axis of a. > > Some example code > > a = np.random.random((2,3)) > x = np.zeros(2) > > a * x # Fails because not broadcastable > > > So how do I multiply x with the columns of a so that for each column j > a[:,j] = a[:,j] * x > > without using a loop. Is there some (fast) fast way to accomplish that > with numpy/scipy? > > Thanks for your help > Robert Elsner > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jsseabold at gmail.com Wed Jun 29 10:38:38 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Wed, 29 Jun 2011 10:38:38 -0400 Subject: [Numpy-discussion] Multiply along axis In-Reply-To: <4E0B376E.6010706@re-factory.de> References: <4E0B376E.6010706@re-factory.de> Message-ID: On Wed, Jun 29, 2011 at 10:32 AM, Robert Elsner wrote: > > Hello everyone, > > I would like to solve the following problem (preferably without > reshaping / flipping the array a). > > Assume I have a vector v of length x and an n-dimensional array a where > one dimension has length x as well. Now I would like to multiply the > vector v along a given axis of a. > > Some example code > > a = np.random.random((2,3)) > x = np.zeros(2) > > a * x ? # Fails because not broadcastable > > > So how do I multiply x with the columns of a so that for each column j > a[:,j] = a[:,j] * x > > without using a loop. Is there some (fast) fast way to accomplish that > with numpy/scipy? > Is this what you want? a * x[:,None] or a * x[:,np.newaxis] or more generally a * np.expand_dims(x,1) Skipper From mlist at re-factory.de Wed Jun 29 11:05:20 2011 From: mlist at re-factory.de (Robert Elsner) Date: Wed, 29 Jun 2011 17:05:20 +0200 Subject: [Numpy-discussion] Multiply along axis In-Reply-To: References: <4E0B376E.6010706@re-factory.de> Message-ID: <4E0B3F30.2010700@re-factory.de> Yeah great that was spot-on. And I thought I knew most of the slicing tricks. I combined it with a slice object so that idx_obj = [ None for i in xrange(a.ndim) ] idx_obj[axis] = slice(None) a * x[idx_object] works the way I want it. Suggestions are welcome but I am happy with the quick solution you pointed out. Thanks On 29.06.2011 16:38, Skipper Seabold wrote: > On Wed, Jun 29, 2011 at 10:32 AM, Robert Elsner wrote: >> Hello everyone, >> >> I would like to solve the following problem (preferably without >> reshaping / flipping the array a). >> >> Assume I have a vector v of length x and an n-dimensional array a where >> one dimension has length x as well. Now I would like to multiply the >> vector v along a given axis of a. >> >> Some example code >> >> a = np.random.random((2,3)) >> x = np.zeros(2) >> >> a * x # Fails because not broadcastable >> >> >> So how do I multiply x with the columns of a so that for each column j >> a[:,j] = a[:,j] * x >> >> without using a loop. Is there some (fast) fast way to accomplish that >> with numpy/scipy? >> > Is this what you want? > > a * x[:,None] > > or > > a * x[:,np.newaxis] > > or more generally > > a * np.expand_dims(x,1) > > Skipper > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From pgmdevlist at gmail.com Wed Jun 29 13:05:22 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Wed, 29 Jun 2011 19:05:22 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B3822.5010908@astro.uio.no> References: <4E0B3822.5010908@astro.uio.no> Message-ID: Matthew, Dag, +1. On Jun 29, 2011 4:35 PM, "Dag Sverre Seljebotn" wrote: > On 06/29/2011 03:45 PM, Matthew Brett wrote: >> Hi, >> >> On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe wrote: >>> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett >>> wrote: >>>> >>>> Hi, >>>> >>>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: >>>> ... >>>>> (You might think, what difference does it make if you *can* unmask an >>>>> item? Us missing data folks could just ignore this feature. But: >>>>> whatever we end up implementing is something that I will have to >>>>> explain over and over to different people, most of them not >>>>> particularly sophisticated programmers. And there's just no sensible >>>>> way to explain this idea that if you store some particular value, then >>>>> it replaces the old value, but if you store NA, then the old value is >>>>> still there. >>>> >>>> Ouch - yes. No question, that is difficult to explain. Well, I >>>> think the explanation might go like this: >>>> >>>> "Ah, yes, well, that's because in fact numpy records missing values by >>>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, >>>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" >>>> >>>> Is that fair? >>> >>> My favorite way of explaining it would be to have a grid of numbers written >>> on paper, then have several cardboards with holes poked in them in different >>> configurations. Placing these cardboard masks in front of the grid would >>> show different sets of non-missing data, without affecting the values stored >>> on the paper behind them. >> >> Right - but here of course you are trying to explain the mask, and >> this is Nathaniel's point, that in order to explain NAs, you have to >> explain masks, and so, even at a basic level, the fusion of the two >> ideas is obvious, and already confusing. I mean this: >> >> a[3] = np.NA >> >> "Oh, so you just set the a[3] value to have some missing value code?" >> >> "Ah - no - in fact what I did was set a associated mask in position >> a[3] so that you can't any longer see the previous value of a[3]" >> >> "Huh. You mean I have a mask for every single value in order to be >> able to blank out a[3]? It looks like an assignment. I mean, it >> looks just like a[3] = 4. But I guess it isn't?" >> >> "Er..." >> >> I think Nathaniel's point is a very good one - these are separate >> ideas, np.NA and np.IGNORE, and a joint implementation is bound to >> draw them together in the mind of the user. Apart from anything >> else, the user has to know that, if they want a single NA value in an >> array, they have to add a mask size array.shape in bytes. They have >> to know then, that NA is implemented by masking, and then the 'NA for >> free by adding masking' idea breaks down and starts to feel like a >> kludge. >> >> The counter argument is of course that, in time, the implementation of >> NA with masking will seem as obvious and intuitive, as, say, >> broadcasting, and that we are just reacting from lack of experience >> with the new API. > > However, no matter how used we get to this, people coming from almost > any other tool (in particular R) will keep think it is > counter-intuitive. Why set up a major semantic incompatability that > people then have to overcome in order to start using NumPy. > > I really don't see what's wrong with some more explicit API like > a.mask[3] = True. "Explicit is better than implicit". > > Dag Sverre > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:07:12 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:07:12 -0500 Subject: [Numpy-discussion] pull request for testing/review: ufunc 'where=' parameter Message-ID: https://github.com/numpy/numpy/pull/99 In [24]: a = np.ones((1000,1000)) In [25]: b = np.ones((1000,1000)) In [26]: c = np.zeros((1000, 1000)) In [27]: m = np.random.rand(1000,1000) > 0.5 In [28]: timeit c[m] = a[m] + b[m] 1 loops, best of 3: 246 ms per loop In [29]: timeit np.add(a, b, out=c, where=m) 10 loops, best of 3: 20.3 ms per loop -Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:14:01 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:14:01 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0AD38B.5030404@astro.uio.no> References: <4E0AD38B.5030404@astro.uio.no> Message-ID: On Wed, Jun 29, 2011 at 2:26 AM, Dag Sverre Seljebotn < d.s.seljebotn at astro.uio.no> wrote: > On 06/27/2011 05:55 PM, Mark Wiebe wrote: > > First I'd like to thank everyone for all the feedback you're providing, > > clearly this is an important topic to many people, and the discussion > > has helped clarify the ideas for me. I've renamed and updated the NEP, > > then placed it into the master NumPy repository so it has a more > > permanent home here: > > > > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst > > One thing to think about is the presence of SSE/AVX instructions, which > has the potential to change some of the memory/speed trade-offs here. > > In the newest Intel-platform CPUs you can do 256-bit operations, > translating to a theoretical factor 8 speedup for in-cache single > precision data, and the instruction set is constructed for future > expansion possibilites to 512 or 1024 bit registers. > The ufuncs themselves need a good bit of refactoring to be able to use these kinds of instructions well. I'm definitely thinking about this kind of thing while designing/implementing. > I feel one should take care to not design oneself into a corner where > this can't (eventually) be leveraged. > > 1) The shuffle instructions takes a single byte as a control character > for moving around data in different ways in 128-bit registers. One could > probably implement fast IGNORE-style NA with a seperate mask using 1 > byte per 16 bytes of data (with 4 or 8-byte elements). OTOH, I'm not > sure if 1 byte per element kind of mask would be that fast (but I don't > know much about this and haven't looked at the details). > This level of optimization, while important, is often dwarfed by the effects of cache. Because of the complexity of the system demanded by the functionality, I'm trying to favor simplicity and generality without precluding high performance. 2) The alternative "Parameterized Data Type Which Adds Additional Memory > for the NA Flag" would mean that contiguous arrays with NA's/IGNORE's > would not be subject to vector instructions, or create a mess of copying > in and out prior to operating on the data. This really seems like the > worst of all possibilites to me. > This one was suggested on the list, so I added it. -Mark > (FWIW, my vote is in favour of both NA-using-NaN and > IGNORE-using-explicit-masks, and keep the two as entirely seperate > worlds to avoid confusion.) > Dag Sverre > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:22:59 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:22:59 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <87oc1gg4ee.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: > Matthew Brett writes: > > >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > >> the idea that the entry is still there, but we're just ignoring it. Of > >> course, that goes against common convention, but it might be easier to > >> explain. > > > I think Nathaniel's point is that np.IGNORE is a different idea than > > np.NA, and that is why joining the implementations can lead to > > conceptual confusion. > > This is how I see it: > > >>> a = np.array([0, 1, 2], dtype=int) > >>> a[0] = np.NA > ValueError > >>> e = np.array([np.NA, 1, 2], dtype=int) > ValueError > >>> b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) > >>> m = np.array([np.NA, 1, 2], dtype=int, masked=True) > >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) > >>> b[1] = np.NA > >>> np.sum(b) > np.NA > >>> np.sum(b, skipna=True) > 2 > >>> b.mask > None > >>> m[1] = np.NA > >>> np.sum(m) > 2 > >>> np.sum(m, skipna=True) > 2 > >>> m.mask > [False, False, True] > >>> bm[1] = np.NA > >>> np.sum(bm) > 2 > >>> np.sum(bm, skipna=True) > 2 > >>> bm.mask > [False, False, True] > > So: > > * Mask takes precedence over bit pattern on element assignment. There's > still the question of how to assign a bit pattern NA when the mask is > active. > > * When using mask, elements are automagically skipped. > > * "m[1] = np.NA" is equivalent to "m.mask[1] = False" > > * When using bit pattern + mask, it might make sense to have the initial > values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, > False, True]" and "np.sum(bm) == np.NA") > There seems to be a general idea that masks and NA bit patterns imply particular differing semantics, something which I think is simply false. Both NaN and Inf are implemented in hardware with the same idea as the NA bit pattern, but they do not follow NA missing value semantics. As far as I can tell, the only required difference between them is that NA bit patterns must destroy the data. Nothing else. Everything on top of that is a choice of API and interface mechanisms. I want them to behave exactly the same except for that necessary difference, so that it will be possible to use the *exact same Python code* with either approach. Say you're using NA dtypes, and suddenly you think, "what if I temporarily treated these as NA too". Now you have to copy your whole array to avoid destroying your data! The NA bit pattern didn't save you memory here... Say you're using masks, and it turns out you didn't actually need masking semantics. If they're different, you now have to do lots of code changes to switch to NA dtypes! -Mark > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:34:48 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:34:48 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: On Wed, Jun 29, 2011 at 8:45 AM, Matthew Brett wrote: > Hi, > > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe wrote: > > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > >> ... > >> > (You might think, what difference does it make if you *can* unmask an > >> > item? Us missing data folks could just ignore this feature. But: > >> > whatever we end up implementing is something that I will have to > >> > explain over and over to different people, most of them not > >> > particularly sophisticated programmers. And there's just no sensible > >> > way to explain this idea that if you store some particular value, then > >> > it replaces the old value, but if you store NA, then the old value is > >> > still there. > >> > >> Ouch - yes. No question, that is difficult to explain. Well, I > >> think the explanation might go like this: > >> > >> "Ah, yes, well, that's because in fact numpy records missing values by > >> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > >> > >> Is that fair? > > > > My favorite way of explaining it would be to have a grid of numbers > written > > on paper, then have several cardboards with holes poked in them in > different > > configurations. Placing these cardboard masks in front of the grid would > > show different sets of non-missing data, without affecting the values > stored > > on the paper behind them. > > Right - but here of course you are trying to explain the mask, and > this is Nathaniel's point, that in order to explain NAs, you have to > explain masks, and so, even at a basic level, the fusion of the two > ideas is obvious, and already confusing. I mean this: > > a[3] = np.NA > > "Oh, so you just set the a[3] value to have some missing value code?" > I would answer "Yes, that's basically true." The abstraction works that way, and there's no reason to confuse people with those implementation details right off the bat. When you introduce a new user to floating point numbers, it would seem odd to first point out that addition isn't associative. That kind of detail is important when you're learning more about the system and digging deeper. I think it was in a Knuth book that I read the idea that the best teaching is a series of lies that successively correct the previous lies. > "Ah - no - in fact what I did was set a associated mask in position > a[3] so that you can't any longer see the previous value of a[3]" > > "Huh. You mean I have a mask for every single value in order to be > able to blank out a[3]? It looks like an assignment. I mean, it > looks just like a[3] = 4. But I guess it isn't?" > > "Er..." > > I think Nathaniel's point is a very good one - these are separate > ideas, np.NA and np.IGNORE, and a joint implementation is bound to > draw them together in the mind of the user. R jointly implements them with the rm.na=T parameter, and that's our model system for missing data. > Apart from anything > else, the user has to know that, if they want a single NA value in an > array, they have to add a mask size array.shape in bytes. They have > to know then, that NA is implemented by masking, and then the 'NA for > free by adding masking' idea breaks down and starts to feel like a > kludge. > > The counter argument is of course that, in time, the implementation of > NA with masking will seem as obvious and intuitive, as, say, > broadcasting, and that we are just reacting from lack of experience > with the new API. > It will literally work the same as the implementation with NA dtypes, except for the masking semantics which requires the extra steps of taking views. > > Of course, that does happen, but here, unless I am mistaken, the > primary drive to fuse NA and masking is because of ease of > implementation. That's not the case, and I've tried to give a slightly better justification for this in my answer Lluis' email. > That doesn't necessarily mean that they don't go > together - if something is easy to implement, sometimes it means it > will also feel natural in use, but at least we might say that there is > some risk of the implementation driving the API, and that that can > lead to problems. > In the design process I'm doing, the implementation concerns are affecting the interface concerns and vice versa, but the missing data semantics are the main driver. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:38:36 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:38:36 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B3822.5010908@astro.uio.no> References: <4E0B3822.5010908@astro.uio.no> Message-ID: On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn < d.s.seljebotn at astro.uio.no> wrote: > On 06/29/2011 03:45 PM, Matthew Brett wrote: > > Hi, > > > > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe wrote: > >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett > >> wrote: > >>> > >>> Hi, > >>> > >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith > wrote: > >>> ... > >>>> (You might think, what difference does it make if you *can* unmask an > >>>> item? Us missing data folks could just ignore this feature. But: > >>>> whatever we end up implementing is something that I will have to > >>>> explain over and over to different people, most of them not > >>>> particularly sophisticated programmers. And there's just no sensible > >>>> way to explain this idea that if you store some particular value, then > >>>> it replaces the old value, but if you store NA, then the old value is > >>>> still there. > >>> > >>> Ouch - yes. No question, that is difficult to explain. Well, I > >>> think the explanation might go like this: > >>> > >>> "Ah, yes, well, that's because in fact numpy records missing values by > >>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > >>> > >>> Is that fair? > >> > >> My favorite way of explaining it would be to have a grid of numbers > written > >> on paper, then have several cardboards with holes poked in them in > different > >> configurations. Placing these cardboard masks in front of the grid would > >> show different sets of non-missing data, without affecting the values > stored > >> on the paper behind them. > > > > Right - but here of course you are trying to explain the mask, and > > this is Nathaniel's point, that in order to explain NAs, you have to > > explain masks, and so, even at a basic level, the fusion of the two > > ideas is obvious, and already confusing. I mean this: > > > > a[3] = np.NA > > > > "Oh, so you just set the a[3] value to have some missing value code?" > > > > "Ah - no - in fact what I did was set a associated mask in position > > a[3] so that you can't any longer see the previous value of a[3]" > > > > "Huh. You mean I have a mask for every single value in order to be > > able to blank out a[3]? It looks like an assignment. I mean, it > > looks just like a[3] = 4. But I guess it isn't?" > > > > "Er..." > > > > I think Nathaniel's point is a very good one - these are separate > > ideas, np.NA and np.IGNORE, and a joint implementation is bound to > > draw them together in the mind of the user. Apart from anything > > else, the user has to know that, if they want a single NA value in an > > array, they have to add a mask size array.shape in bytes. They have > > to know then, that NA is implemented by masking, and then the 'NA for > > free by adding masking' idea breaks down and starts to feel like a > > kludge. > > > > The counter argument is of course that, in time, the implementation of > > NA with masking will seem as obvious and intuitive, as, say, > > broadcasting, and that we are just reacting from lack of experience > > with the new API. > > However, no matter how used we get to this, people coming from almost > any other tool (in particular R) will keep think it is > counter-intuitive. Why set up a major semantic incompatability that > people then have to overcome in order to start using NumPy. > I'm not aware of a semantic incompatibility. I believe R doesn't support views like NumPy does, so the things you have to do to see masking semantics aren't even possible in R. I really don't see what's wrong with some more explicit API like > a.mask[3] = True. "Explicit is better than implicit". > I agree, but initial feedback was that the way R deals with NA values is very nice, and I've come to agree that it's worth emulating. -Mark > > Dag Sverre > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:48:33 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:48:33 -0500 Subject: [Numpy-discussion] Missing/accumulating data In-Reply-To: References: Message-ID: Yeah, it takes a long time to wade through and respond to everything. I think the "missing data" problem and weighted masking are closely related, but neither one is fully a subset of the other. With a non-boolean alpha mask, there's an implication of a multiplication operator in there somewhere, but with a boolean mask, the data can be any data whatsoever that doesn't necessarily support any kind of blending operations. For the image accumulation you're describing, I would use either a structured array with 'color' and 'weight' fields, or have the last element of the color channel be the weight (like an RGBA image) so adding multiple weighted images together would add both the colors and the weights simultaneously, without requiring a ufunc extension supporting struct dtypes. Cheers, Mark On Tue, Jun 28, 2011 at 11:50 AM, Joe Harrington wrote: > As with Travis, I have not had time to wade through the 150+ messages > on masked arrays, but I'd like to raise a concept I've mentioned in > the past that would enable a broader use if done slightly differently. > That is, the "masked array" problem is a subset of this more-general > problem. Please respond on this thread if you want me to see the > replies. > > The problem is of data accumulation and averaging, and it comes up all > the time in astronomy, medical imaging, and other fields that handle > stacks of images. Say that I observe a planet, taking images at, say, > 30 Hz for several hours, as with many video cameras. I want to > register or map those images to a common coordinate system, stack > them, and average them. I must recognize: > > 1. transient there are bad pixels due to cosmic ray hits on the CCD array > 2. some pixels are permanently dead and always produce bad data > 3. some data in the image is not part of the planet > 4. some images have too much atmospheric blurring to be used > 5. sometimes moons of planets like Jupiter pass in front > 6. etc. > > For each original image: > Make a mask where 1 is good and 0 is bad. This trims both bad > pixels and pixels that are not part of the planet. > Evaluate the images and zero the entire image (or regions of an > image) if quality is low (or there is an obstruction). > Shift (or map, in the case of lon-lat mapping) the image AND THE > MASK to a common coordinate grid. > Append the image and the mask to the ends of the stacks of prior > images and masks. > > For each x and y position in the image/mask stacks: > Perform some averaging of all pixels in the stack that are not masked, > or otherwise use the mask info to decide how to use the image info. > > The most common of these averages is to multiply the stack of images > by the stack of masks, zeroing the bad data, then sum both images and > masks down the third axis. This produces an accumulator image and a > count image, where the count image says how many data points went into > the accumulator at each pixel position. Divide the accumulator by the > count. This produces an image with no missing data and the highest > SNR per pixel possible. > > Two things differ from the current ma package: the sense of TRUE/FALSE > and the ability to have numbers other than 0 or 1. For example, if > data quality were part of the assessment, each pixel could get a data > quality number and the accumulation would only include pixels with > better than a certain number, or there could be weighted averaging. > > Currently, we just implement this as described, without using the > masked array capability at all. However, it's cumbersome to hand > around pairs of objects, and making a new object to handle them > together means breaking it apart when calling standard routines that > don't handle masks. There are probably some very fast ways to do > interesting kinds of averages that we haven't explored but would if > this kind of capability were in the infrastructure. > > So, I hope that if some kind of masking goes into the array object, it > allows TRUE=good, FALSE=bad, and at least an integer count if not a > floating quality number. One bit is not enough. Of course, these > could be options. Certainly adding 7 or 31 more bits to the mask will > make space problems for some, as would flipping the sense of the > mask. However, speed matters. Some of those doing this kind of > processing are handling hours of staring video data, which is 1e6 > frames a night from each camera. > > --jh-- > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Wed Jun 29 13:53:09 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Wed, 29 Jun 2011 12:53:09 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <87vcvqt9s0.fsf@ginnungagap.bsc.es> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <87vcvqt9s0.fsf@ginnungagap.bsc.es> Message-ID: On Tue, Jun 28, 2011 at 7:34 AM, Llu?s wrote: > Mark Wiebe writes: > > The design that's forming is a combination of: > > > * Solve the missing data problem > > * My ideas of what a good solution looks like: > > * applies to all NumPy dtypes in a fully general way > > * high-performance, low overhead where possible > > * makes the C-level implementation of NumPy nicer to work with, not > harder > > * easy to use from Python for unskilled programmers > > * easy to use more powerful functionality from Python for skilled > programmers > > * satisfies all or most of the needs of the many users of arrays with > a "missing data" aspect to them > > I would add here an efficient mechanism to reinterpret exising data with > different missing information (no copies of the backing array). > > Although I'm not sure whether this requires first-class citizenship or > not. > I'm calling this idea "masking semantics" generally. > * All the feedback I'm getting from discussions on the list > [...] > > I've updated a section "Parameterized Data Type With NA Signal Values" > > in the NEP with an idea for now an NA bit pattern approach could > > coexist and work together with the mask-based approach. I think I've > > solved some of the generality and implementation obstacles, it would > > be great to get some feedback on that. > > Some (obvious) thoughts about it: > > * Trivial to store, as the missing property is encoded in the value > itself. > * Third-party (non-Python) code needs some interface to interpret these > without having to know the implementation details (although the > interface is rather trivial). > * Data marked as missing loses its original value. > * Reinterpreting the same data (memory buffer) with different missing > information requires either memory copies or separate mask arrays (see > above) > > So, while it (data types with NA signal values) has its advantages on a > simpler interaction with 3rd party code and during long-term storage, > masks will still be needed. > > I think that deciding on the value of NA signal values boils down to > this question: should 3rd party code be able to interpret missing data > information stored in the separate mask array? > I'm tossing around some variations of ideas using the iterator to provide a buffered mask-based interface that works uniformly with both masked arrays and NA dtypes. This way 3rd party C code only needs to implement one missing data mechanism to fully support both of NumPy's missing data mechanisms. -Mark > If the answer is no, then 3rd party code should be given a copy of the > data where the masked array is merged with the ndarray data buffer > (assuming the original ndarray had a masked array before passing it to > the 3rd party code). As by definition (?) the ndarray with a mask must > retain the original data, the result of the 3rd party code must be > translated back into an ndarray + mask. > > If the answer is yes, then I think the NA signal values just add > unnecessary complexity, as the 3rd party code will already need to use > some numpy-specific API to handle missing data through the ndarray > buffer + mask buffer. This reminds me that if 3rd party were to use the > new iterator interface, the interface could be twisted in a way that it > returns only the non-missing parts. For the sake of performance, this > could be optional, so that the default behaviour is to just iterate > through non-missing data but an option can be used to iterate over all > data, and leave missing data handling up to the 3rd party code. > > > My 2 cents, > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 29 14:04:18 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 29 Jun 2011 12:04:18 -0600 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <87vcvqt9s0.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 11:53 AM, Mark Wiebe wrote: > On Tue, Jun 28, 2011 at 7:34 AM, Llu?s wrote: > >> Mark Wiebe writes: >> > The design that's forming is a combination of: >> >> > * Solve the missing data problem >> > * My ideas of what a good solution looks like: >> > * applies to all NumPy dtypes in a fully general way >> > * high-performance, low overhead where possible >> > * makes the C-level implementation of NumPy nicer to work with, not >> harder >> > * easy to use from Python for unskilled programmers >> > * easy to use more powerful functionality from Python for skilled >> programmers >> > * satisfies all or most of the needs of the many users of arrays with >> a "missing data" aspect to them >> >> I would add here an efficient mechanism to reinterpret exising data with >> different missing information (no copies of the backing array). >> >> Although I'm not sure whether this requires first-class citizenship or >> not. >> > > I'm calling this idea "masking semantics" generally. > > > * All the feedback I'm getting from discussions on the list >> [...] >> > I've updated a section "Parameterized Data Type With NA Signal Values" >> > in the NEP with an idea for now an NA bit pattern approach could >> > coexist and work together with the mask-based approach. I think I've >> > solved some of the generality and implementation obstacles, it would >> > be great to get some feedback on that. >> >> Some (obvious) thoughts about it: >> >> * Trivial to store, as the missing property is encoded in the value >> itself. >> * Third-party (non-Python) code needs some interface to interpret these >> without having to know the implementation details (although the >> interface is rather trivial). >> * Data marked as missing loses its original value. >> * Reinterpreting the same data (memory buffer) with different missing >> information requires either memory copies or separate mask arrays (see >> above) >> >> So, while it (data types with NA signal values) has its advantages on a >> simpler interaction with 3rd party code and during long-term storage, >> masks will still be needed. >> >> I think that deciding on the value of NA signal values boils down to >> this question: should 3rd party code be able to interpret missing data >> information stored in the separate mask array? >> > > I'm tossing around some variations of ideas using the iterator to provide a > buffered mask-based interface that works uniformly with both masked arrays > and NA dtypes. This way 3rd party C code only needs to implement one missing > data mechanism to fully support both of NumPy's missing data mechanisms. > > ;) Also, it avoids a horrible mass of code. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.s.seljebotn at astro.uio.no Wed Jun 29 14:07:45 2011 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Wed, 29 Jun 2011 20:07:45 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0B3822.5010908@astro.uio.no> Message-ID: <4E0B69F1.9000805@astro.uio.no> On 06/29/2011 07:38 PM, Mark Wiebe wrote: > On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn > > wrote: > > On 06/29/2011 03:45 PM, Matthew Brett wrote: > > Hi, > > > > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe > wrote: > >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew > Brett> > >> wrote: > >>> > >>> Hi, > >>> > >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith > wrote: > >>> ... > >>>> (You might think, what difference does it make if you *can* > unmask an > >>>> item? Us missing data folks could just ignore this feature. But: > >>>> whatever we end up implementing is something that I will have to > >>>> explain over and over to different people, most of them not > >>>> particularly sophisticated programmers. And there's just no > sensible > >>>> way to explain this idea that if you store some particular > value, then > >>>> it replaces the old value, but if you store NA, then the old > value is > >>>> still there. > >>> > >>> Ouch - yes. No question, that is difficult to explain. Well, I > >>> think the explanation might go like this: > >>> > >>> "Ah, yes, well, that's because in fact numpy records missing > values by > >>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, > >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > >>> > >>> Is that fair? > >> > >> My favorite way of explaining it would be to have a grid of > numbers written > >> on paper, then have several cardboards with holes poked in them > in different > >> configurations. Placing these cardboard masks in front of the > grid would > >> show different sets of non-missing data, without affecting the > values stored > >> on the paper behind them. > > > > Right - but here of course you are trying to explain the mask, and > > this is Nathaniel's point, that in order to explain NAs, you have to > > explain masks, and so, even at a basic level, the fusion of the two > > ideas is obvious, and already confusing. I mean this: > > > > a[3] = np.NA > > > > "Oh, so you just set the a[3] value to have some missing value code?" > > > > "Ah - no - in fact what I did was set a associated mask in position > > a[3] so that you can't any longer see the previous value of a[3]" > > > > "Huh. You mean I have a mask for every single value in order to be > > able to blank out a[3]? It looks like an assignment. I mean, it > > looks just like a[3] = 4. But I guess it isn't?" > > > > "Er..." > > > > I think Nathaniel's point is a very good one - these are separate > > ideas, np.NA and np.IGNORE, and a joint implementation is bound to > > draw them together in the mind of the user. Apart from anything > > else, the user has to know that, if they want a single NA value in an > > array, they have to add a mask size array.shape in bytes. They have > > to know then, that NA is implemented by masking, and then the 'NA for > > free by adding masking' idea breaks down and starts to feel like a > > kludge. > > > > The counter argument is of course that, in time, the > implementation of > > NA with masking will seem as obvious and intuitive, as, say, > > broadcasting, and that we are just reacting from lack of experience > > with the new API. > > However, no matter how used we get to this, people coming from almost > any other tool (in particular R) will keep think it is > counter-intuitive. Why set up a major semantic incompatability that > people then have to overcome in order to start using NumPy. > > > I'm not aware of a semantic incompatibility. I believe R doesn't support > views like NumPy does, so the things you have to do to see masking > semantics aren't even possible in R. Well, whether the same feature is possible or not in R is irrelevant to whether a semantic incompatability would exist. Views themselves are a *major* semantic incompatability, and are highly confusing at first to MATLAB/Fortran/R people. However they have major advantages outweighing the disadvantage of having to caution new users. But there's simply no precedence anywhere for an assignment that doesn't erase the old value for a particular input value, and the advantages seem pretty minor (well, I think it is ugly in its own right, but that is besides the point...) Dag Sverre From xscript at gmx.net Wed Jun 29 14:20:57 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Wed, 29 Jun 2011 20:20:57 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: (Mark Wiebe's message of "Wed, 29 Jun 2011 12:22:59 -0500") References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: <877h84a48m.fsf@ginnungagap.bsc.es> Mark Wiebe writes: > There seems to be a general idea that masks and NA bit patterns imply > particular differing semantics, something which I think is simply > false. Well, my example contained a difference (the need for the "skipna=True" argument) precisely because it seemed that there was some need for different defaults. Honestly, I think this difference breaks the POLA (principle of least astonishment). [...] > As far as I can tell, the only required difference between them is > that NA bit patterns must destroy the data. Nothing else. Everything > on top of that is a choice of API and interface mechanisms. I want > them to behave exactly the same except for that necessary difference, > so that it will be possible to use the *exact same Python code* with > either approach. I completely agree. What I'd suggest is a global and/or per-object "ndarray.flags.skipna" for people like me that just want to ignore these entries without caring about setting it on each operaion (or the other way around, depends on the default behaviour). The downside is that it adds yet another tweaking knob, which is not desirable... Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From bsouthey at gmail.com Wed Jun 29 14:49:11 2011 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 29 Jun 2011 13:49:11 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B69F1.9000805@astro.uio.no> References: <4E0B3822.5010908@astro.uio.no> <4E0B69F1.9000805@astro.uio.no> Message-ID: <4E0B73A7.4030003@gmail.com> On 06/29/2011 01:07 PM, Dag Sverre Seljebotn wrote: > On 06/29/2011 07:38 PM, Mark Wiebe wrote: >> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn >> > wrote: >> >> On 06/29/2011 03:45 PM, Matthew Brett wrote: >> > Hi, >> > >> > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe> > wrote: >> >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew >> Brett> >> >> wrote: >> >>> >> >>> Hi, >> >>> >> >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith> > wrote: >> >>> ... >> >>>> (You might think, what difference does it make if you *can* >> unmask an >> >>>> item? Us missing data folks could just ignore this feature. But: >> >>>> whatever we end up implementing is something that I will have to >> >>>> explain over and over to different people, most of them not >> >>>> particularly sophisticated programmers. And there's just no >> sensible >> >>>> way to explain this idea that if you store some particular >> value, then >> >>>> it replaces the old value, but if you store NA, then the old >> value is >> >>>> still there. >> >>> >> >>> Ouch - yes. No question, that is difficult to explain. Well, I >> >>> think the explanation might go like this: >> >>> >> >>> "Ah, yes, well, that's because in fact numpy records missing >> values by >> >>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is, >> >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" >> >>> >> >>> Is that fair? >> >> >> >> My favorite way of explaining it would be to have a grid of >> numbers written >> >> on paper, then have several cardboards with holes poked in them >> in different >> >> configurations. Placing these cardboard masks in front of the >> grid would >> >> show different sets of non-missing data, without affecting the >> values stored >> >> on the paper behind them. >> > >> > Right - but here of course you are trying to explain the mask, and >> > this is Nathaniel's point, that in order to explain NAs, you have to >> > explain masks, and so, even at a basic level, the fusion of the two >> > ideas is obvious, and already confusing. I mean this: >> > >> > a[3] = np.NA >> > >> > "Oh, so you just set the a[3] value to have some missing value code?" >> > >> > "Ah - no - in fact what I did was set a associated mask in position >> > a[3] so that you can't any longer see the previous value of a[3]" >> > >> > "Huh. You mean I have a mask for every single value in order to be >> > able to blank out a[3]? It looks like an assignment. I mean, it >> > looks just like a[3] = 4. But I guess it isn't?" >> > >> > "Er..." >> > >> > I think Nathaniel's point is a very good one - these are separate >> > ideas, np.NA and np.IGNORE, and a joint implementation is bound to >> > draw them together in the mind of the user. Apart from anything >> > else, the user has to know that, if they want a single NA value in an >> > array, they have to add a mask size array.shape in bytes. They have >> > to know then, that NA is implemented by masking, and then the 'NA for >> > free by adding masking' idea breaks down and starts to feel like a >> > kludge. >> > >> > The counter argument is of course that, in time, the >> implementation of >> > NA with masking will seem as obvious and intuitive, as, say, >> > broadcasting, and that we are just reacting from lack of experience >> > with the new API. >> >> However, no matter how used we get to this, people coming from almost >> any other tool (in particular R) will keep think it is >> counter-intuitive. Why set up a major semantic incompatability that >> people then have to overcome in order to start using NumPy. >> >> >> I'm not aware of a semantic incompatibility. I believe R doesn't support >> views like NumPy does, so the things you have to do to see masking >> semantics aren't even possible in R. > Well, whether the same feature is possible or not in R is irrelevant to > whether a semantic incompatability would exist. > > Views themselves are a *major* semantic incompatability, and are highly > confusing at first to MATLAB/Fortran/R people. However they have major > advantages outweighing the disadvantage of having to caution new users. > > But there's simply no precedence anywhere for an assignment that doesn't > erase the old value for a particular input value, and the advantages > seem pretty minor (well, I think it is ugly in its own right, but that > is besides the point...) > > Dag Sverre > _______________________________________________ Depending on what you really mean by 'precedence', in most stats software (R, SAS, etc.) it is completely up to the user to do this and do it correctly. Usually you store the original data and create new working data as needed as either the equivalent of a masked array or a view. Quite often I have need to have the same variable as a float and a string to be able to have the bets of both worlds (otherwise it is a pain of not only going back and forth but ensuring the right type is being used at the right time). The really really huge advantages of a masked arrays is that you only need the original data plus the mask and it is so easy to find the 'flagged' observations. There is actually no reason why you can not create a masked array or views in R - it is a language after all with the necessary features. So I do not see any semantic issues rather that it has not been implemented because there is insufficient interest. A likely reason is R's very strong statistical heritage so it has been up to the user to make their data fit the existing statistical routines. Really all this discussion just further highlights the value of masked arrays as it really extends the usage of some missing value coding. At present the cost is memory and the performance is a wait and see as there may not be a difference because in both cases the code has to know if a element is masked or missing and then how to deal with it. Bruce From xscript at gmx.net Wed Jun 29 14:51:36 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Wed, 29 Jun 2011 20:51:36 +0200 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: (Mark Wiebe's message of "Wed, 29 Jun 2011 12:53:09 -0500") References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <87vcvqt9s0.fsf@ginnungagap.bsc.es> Message-ID: <87ei2c8o93.fsf@ginnungagap.bsc.es> Mark Wiebe writes: [...] > I think that deciding on the value of NA signal values boils down to > this question: should 3rd party code be able to interpret missing data > information stored in the separate mask array? > I'm tossing around some variations of ideas using the iterator to > provide a buffered mask-based interface that works uniformly with both > masked arrays and NA dtypes. This way 3rd party C code only needs to > implement one missing data mechanism to fully support both of NumPy's > missing data mechanisms. Nice. If non-numpy C code is bound to see it as an array (i.e., _always_ oblivious to the mask concept), then you should probably do what I said about "(un)merging" the bit pattern and mask-based NAs, but in this case can be done on each block given by the iteration window. There's still the possibility of giving a finer granularity interface where both are explicitly accessed, but this will probably add yet another set of API functions (although the merging interface can be implemented on top of this explicit raw iteration interface). BTW, this has some overlapping with a mail Travis sent long ago about dynamically filling the backing byffer contents (in this case with the "merged" NA data for 3rd parties). It might prove completely unsatisfactory (w.r.t. performance), but you could also fake a bit-pattern-only sequential array by using mprotect to detect the memory accesses and trigger then the production of the merged data. This provides means for code using the simple buffer protocol, without duplicating the whole structure for NA merges. This can be complicated even more with some simple strided pattern detection to diminish the number of segfaults, as the shape is known. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From matthew.brett at gmail.com Wed Jun 29 15:32:14 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 20:32:14 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe wrote: > On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: >> >> Matthew Brett writes: >> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys >> >> the idea that the entry is still there, but we're just ignoring it. ?Of >> >> course, that goes against common convention, but it might be easier to >> >> explain. >> >> > I think Nathaniel's point is that np.IGNORE is a different idea than >> > np.NA, and that is why joining the implementations can lead to >> > conceptual confusion. >> >> This is how I see it: >> >> >>> a = np.array([0, 1, 2], dtype=int) >> >>> a[0] = np.NA >> ValueError >> >>> e = np.array([np.NA, 1, 2], dtype=int) >> ValueError >> >>> b ?= np.array([np.NA, 1, 2], dtype=np.maybe(int)) >> >>> m ?= np.array([np.NA, 1, 2], dtype=int, masked=True) >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) >> >>> b[1] = np.NA >> >>> np.sum(b) >> np.NA >> >>> np.sum(b, skipna=True) >> 2 >> >>> b.mask >> None >> >>> m[1] = np.NA >> >>> np.sum(m) >> 2 >> >>> np.sum(m, skipna=True) >> 2 >> >>> m.mask >> [False, False, True] >> >>> bm[1] = np.NA >> >>> np.sum(bm) >> 2 >> >>> np.sum(bm, skipna=True) >> 2 >> >>> bm.mask >> [False, False, True] >> >> So: >> >> * Mask takes precedence over bit pattern on element assignment. There's >> ?still the question of how to assign a bit pattern NA when the mask is >> ?active. >> >> * When using mask, elements are automagically skipped. >> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False" >> >> * When using bit pattern + mask, it might make sense to have the initial >> ?values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, >> ?False, True]" and "np.sum(bm) == np.NA") > > There seems to be a general idea that masks and NA bit patterns imply > particular differing semantics, something which I think is simply false. Well - first - it's helpful surely to separate the concepts and the implementation. Concepts / use patterns (as delineated by Nathaniel): A) missing values == 'np.NA' in my emails. Can we call that CMV (concept missing values)? B) masks == np.IGNORE in my emails . CMSK (concept masks)? Implementations 1) bit-pattern == na-dtype - how about we call that IBP (implementation bit patten)? 2) array.mask. IM (implementation mask)? Nathaniel implied that: CMV implies: sum([np.NA, 1]) == np.NA CMSK implies sum([np.NA, 1]) == 1 and indeed, that's how R and masked arrays respectively behave. So I think it's reasonable to say that at least R thought that the bitmask implied the first and Pierre and others thought the mask meant the second. The NEP as it stands thinks of CMV and and CM as being different views of the same thing, Please correct me if I'm wrong. > Both NaN and Inf are implemented in hardware with the same idea as the NA > bit pattern, but they do not follow NA missing value semantics. Right - and that doesn't affect the argument, because the argument is about the concepts and not the implementation. > As far as I can tell, the only required difference between them is that NA > bit patterns must destroy the data. Nothing else. I think Nathaniel's point was about the expected default behavior in the different concepts. > Everything on top of that > is a choice of API and interface mechanisms. I want them to behave exactly > the same except for that necessary difference, so that it will be possible > to use the *exact same Python code* with either approach. Right. And Nathaniel's point is that that desire leads to fusion of the two ideas into one when they should be separated. For example, if I understand correctly: >>> a = np.array([1.0, 2.0, 3, 7.0], masked=True) >>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >>> a[3] = np.NA # actual real hand-on-heart assignment >>> b[3] = np.NA # magic mask setting although it looks the same > Say you're using NA dtypes, and suddenly you think, "what if I temporarily > treated these as NA too". Now you have to copy your whole array to avoid > destroying your data! The NA bit pattern didn't save you memory here... Say > you're using masks, and it turns out you didn't actually need masking > semantics. If they're different, you now have to do lots of code changes to > switch to NA dtypes! I personally have not run across that case. I'd imagine that, if you knew you wanted to do something so explicitly masking-like, you'd start with the masking interface. Clearly there are some overlaps between what masked arrays are trying to achieve and what Rs NA mechanisms are trying to achieve. Are they really similar enough that they should function using the same API? And if so, won't that be confusing? I think that's the question that's being asked. See you, Matthew From matthew.brett at gmail.com Wed Jun 29 15:35:35 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 20:35:35 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: Oops, On Wed, Jun 29, 2011 at 8:32 PM, Matthew Brett wrote: > Hi, > > On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe wrote: >> On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: >>> >>> Matthew Brett writes: >>> >>> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys >>> >> the idea that the entry is still there, but we're just ignoring it. ?Of >>> >> course, that goes against common convention, but it might be easier to >>> >> explain. >>> >>> > I think Nathaniel's point is that np.IGNORE is a different idea than >>> > np.NA, and that is why joining the implementations can lead to >>> > conceptual confusion. >>> >>> This is how I see it: >>> >>> >>> a = np.array([0, 1, 2], dtype=int) >>> >>> a[0] = np.NA >>> ValueError >>> >>> e = np.array([np.NA, 1, 2], dtype=int) >>> ValueError >>> >>> b ?= np.array([np.NA, 1, 2], dtype=np.maybe(int)) >>> >>> m ?= np.array([np.NA, 1, 2], dtype=int, masked=True) >>> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) >>> >>> b[1] = np.NA >>> >>> np.sum(b) >>> np.NA >>> >>> np.sum(b, skipna=True) >>> 2 >>> >>> b.mask >>> None >>> >>> m[1] = np.NA >>> >>> np.sum(m) >>> 2 >>> >>> np.sum(m, skipna=True) >>> 2 >>> >>> m.mask >>> [False, False, True] >>> >>> bm[1] = np.NA >>> >>> np.sum(bm) >>> 2 >>> >>> np.sum(bm, skipna=True) >>> 2 >>> >>> bm.mask >>> [False, False, True] >>> >>> So: >>> >>> * Mask takes precedence over bit pattern on element assignment. There's >>> ?still the question of how to assign a bit pattern NA when the mask is >>> ?active. >>> >>> * When using mask, elements are automagically skipped. >>> >>> * "m[1] = np.NA" is equivalent to "m.mask[1] = False" >>> >>> * When using bit pattern + mask, it might make sense to have the initial >>> ?values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, >>> ?False, True]" and "np.sum(bm) == np.NA") >> >> There seems to be a general idea that masks and NA bit patterns imply >> particular differing semantics, something which I think is simply false. > > Well - first - it's helpful surely to separate the concepts and the > implementation. > > Concepts / use patterns (as delineated by Nathaniel): > A) missing values == 'np.NA' in my emails. ?Can we call that CMV > (concept missing values)? > B) masks == np.IGNORE in my emails . CMSK (concept masks)? > > Implementations > 1) bit-pattern == na-dtype - how about we call that IBP > (implementation bit patten)? > 2) array.mask. ?IM (implementation mask)? > > Nathaniel implied that: > > CMV implies: sum([np.NA, 1]) == np.NA > CMSK implies sum([np.NA, 1]) == 1 > > and indeed, that's how R and masked arrays respectively behave. ?So I > think it's reasonable to say that at least R thought that the bitmask > implied the first and Pierre and others thought the mask meant the > second. > > The NEP as it stands thinks of CMV and and CM as being different views > of the same thing, ? Please correct me if I'm wrong. > >> Both NaN and Inf are implemented in hardware with the same idea as the NA >> bit pattern, but they do not follow NA missing value semantics. > > Right - and that doesn't affect the argument, because the argument is > about the concepts and not the implementation. > >> As far as I can tell, the only required difference between them is that NA >> bit patterns must destroy the data. Nothing else. > > I think Nathaniel's point was about the expected default behavior in > the different concepts. > >> Everything on top of that >> is a choice of API and interface mechanisms. I want them to behave exactly >> the same except for that necessary difference, so that it will be possible >> to use the *exact same Python code* with either approach. > > Right. ?And Nathaniel's point is that that desire leads to fusion of > the two ideas into one when they should be separated. ?For example, if > I understand correctly: > >>>> a = np.array([1.0, 2.0, 3, 7.0], masked=True) >>>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >>>> a[3] = np.NA ?# actual real hand-on-heart assignment >>>> b[3] = np.NA # magic mask setting although it looks the same I meant: >>> a = np.array([1.0, 2.0, 3.0, 7.0], masked=True) >>> b = np.array([1.0, 2.0, 3.0, 7.0], dtype='NA[f8]') >>> b[3] = np.NA ?# actual real hand-on-heart assignment >>> a[3] = np.NA # magic mask setting although it looks the same Sorry, Matthew From matthew.brett at gmail.com Wed Jun 29 15:47:01 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 20:47:01 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <877h84a48m.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Wed, Jun 29, 2011 at 7:20 PM, Llu?s wrote: > Mark Wiebe writes: > >> There seems to be a general idea that masks and NA bit patterns imply >> particular differing semantics, something which I think is simply >> false. > > Well, my example contained a difference (the need for the "skipna=True" > argument) precisely because it seemed that there was some need for > different defaults. > > Honestly, I think this difference breaks the POLA (principle of least > astonishment). > > > [...] >> As far as I can tell, the only required difference between them is >> that NA bit patterns must destroy the data. Nothing else. Everything >> on top of that is a choice of API and interface mechanisms. I want >> them to behave exactly the same except for that necessary difference, >> so that it will be possible to use the *exact same Python code* with >> either approach. > > I completely agree. What I'd suggest is a global and/or per-object > "ndarray.flags.skipna" for people like me that just want to ignore these > entries without caring about setting it on each operaion (or the other > way around, depends on the default behaviour). > > The downside is that it adds yet another tweaking knob, which is not > desirable... Oh - dear - that would be horrible, if, depending on the tweak somewhere in the distant past of your script, this: >>> a = np.array([np.NA, 1.0], masked=True) >>> np.sum(a) could return either np.NA or 1.0... Imagine someone twiddled the knob the other way and ran your script... See you, Matthew From charlesr.harris at gmail.com Wed Jun 29 16:17:58 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 29 Jun 2011 14:17:58 -0600 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett wrote: > Hi, > > On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe wrote: > > On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: > >> > >> Matthew Brett writes: > >> > >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > >> >> the idea that the entry is still there, but we're just ignoring it. > Of > >> >> course, that goes against common convention, but it might be easier > to > >> >> explain. > >> > >> > I think Nathaniel's point is that np.IGNORE is a different idea than > >> > np.NA, and that is why joining the implementations can lead to > >> > conceptual confusion. > >> > >> This is how I see it: > >> > >> >>> a = np.array([0, 1, 2], dtype=int) > >> >>> a[0] = np.NA > >> ValueError > >> >>> e = np.array([np.NA, 1, 2], dtype=int) > >> ValueError > >> >>> b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) > >> >>> m = np.array([np.NA, 1, 2], dtype=int, masked=True) > >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) > >> >>> b[1] = np.NA > >> >>> np.sum(b) > >> np.NA > >> >>> np.sum(b, skipna=True) > >> 2 > >> >>> b.mask > >> None > >> >>> m[1] = np.NA > >> >>> np.sum(m) > >> 2 > >> >>> np.sum(m, skipna=True) > >> 2 > >> >>> m.mask > >> [False, False, True] > >> >>> bm[1] = np.NA > >> >>> np.sum(bm) > >> 2 > >> >>> np.sum(bm, skipna=True) > >> 2 > >> >>> bm.mask > >> [False, False, True] > >> > >> So: > >> > >> * Mask takes precedence over bit pattern on element assignment. There's > >> still the question of how to assign a bit pattern NA when the mask is > >> active. > >> > >> * When using mask, elements are automagically skipped. > >> > >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False" > >> > >> * When using bit pattern + mask, it might make sense to have the initial > >> values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, > >> False, True]" and "np.sum(bm) == np.NA") > > > > There seems to be a general idea that masks and NA bit patterns imply > > particular differing semantics, something which I think is simply false. > > Well - first - it's helpful surely to separate the concepts and the > implementation. > > Concepts / use patterns (as delineated by Nathaniel): > A) missing values == 'np.NA' in my emails. Can we call that CMV > (concept missing values)? > B) masks == np.IGNORE in my emails . CMSK (concept masks)? > > Implementations > 1) bit-pattern == na-dtype - how about we call that IBP > (implementation bit patten)? > 2) array.mask. IM (implementation mask)? > > Remember that the masks are invisible, you can't see them, they are an implementation detail. A good reason to hide the implementation is so it can be changed without impacting software that depends on the API. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Jun 29 16:25:24 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 29 Jun 2011 21:25:24 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Wed, Jun 29, 2011 at 9:17 PM, Charles R Harris wrote: > > > On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe wrote: >> > On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: >> >> >> >> Matthew Brett writes: >> >> >> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of >> >> >> conveys >> >> >> the idea that the entry is still there, but we're just ignoring it. >> >> >> ?Of >> >> >> course, that goes against common convention, but it might be easier >> >> >> to >> >> >> explain. >> >> >> >> > I think Nathaniel's point is that np.IGNORE is a different idea than >> >> > np.NA, and that is why joining the implementations can lead to >> >> > conceptual confusion. >> >> >> >> This is how I see it: >> >> >> >> >>> a = np.array([0, 1, 2], dtype=int) >> >> >>> a[0] = np.NA >> >> ValueError >> >> >>> e = np.array([np.NA, 1, 2], dtype=int) >> >> ValueError >> >> >>> b ?= np.array([np.NA, 1, 2], dtype=np.maybe(int)) >> >> >>> m ?= np.array([np.NA, 1, 2], dtype=int, masked=True) >> >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) >> >> >>> b[1] = np.NA >> >> >>> np.sum(b) >> >> np.NA >> >> >>> np.sum(b, skipna=True) >> >> 2 >> >> >>> b.mask >> >> None >> >> >>> m[1] = np.NA >> >> >>> np.sum(m) >> >> 2 >> >> >>> np.sum(m, skipna=True) >> >> 2 >> >> >>> m.mask >> >> [False, False, True] >> >> >>> bm[1] = np.NA >> >> >>> np.sum(bm) >> >> 2 >> >> >>> np.sum(bm, skipna=True) >> >> 2 >> >> >>> bm.mask >> >> [False, False, True] >> >> >> >> So: >> >> >> >> * Mask takes precedence over bit pattern on element assignment. There's >> >> ?still the question of how to assign a bit pattern NA when the mask is >> >> ?active. >> >> >> >> * When using mask, elements are automagically skipped. >> >> >> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False" >> >> >> >> * When using bit pattern + mask, it might make sense to have the >> >> initial >> >> ?values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, >> >> ?False, True]" and "np.sum(bm) == np.NA") >> > >> > There seems to be a general idea that masks and NA bit patterns imply >> > particular differing semantics, something which I think is simply false. >> >> Well - first - it's helpful surely to separate the concepts and the >> implementation. >> >> Concepts / use patterns (as delineated by Nathaniel): >> A) missing values == 'np.NA' in my emails. ?Can we call that CMV >> (concept missing values)? >> B) masks == np.IGNORE in my emails . CMSK (concept masks)? >> >> Implementations >> 1) bit-pattern == na-dtype - how about we call that IBP >> (implementation bit patten)? >> 2) array.mask. ?IM (implementation mask)? >> > > Remember that the masks are invisible, you can't see them, they are an > implementation detail. A good reason to hide the implementation is so it can > be changed without impacting software that depends on the API. It's not true that you can't see them because masks are using the same API as for missing values. Because they're using the same API, the person using the CMV stuff will soon find out about the masks, accidentally or not, then they will need to understand masking, and that is the problem we're discussing here. See you, Matthew From njs at pobox.com Wed Jun 29 16:41:24 2011 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 29 Jun 2011 13:41:24 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <877h84a48m.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 11:20 AM, Llu?s wrote: > I completely agree. What I'd suggest is a global and/or per-object > "ndarray.flags.skipna" for people like me that just want to ignore these > entries without caring about setting it on each operaion (or the other > way around, depends on the default behaviour). I agree with with Matthew that this approach would end up having horrible side-effects, but I can see why you'd want some way to accomplish this... I suggested another approach to handling both NA-style and mask-style missing data by making them totally separate features. It's buried at the bottom of this over-long message (you can search for "my proposal"): http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057251.html I know that the part 1 of that proposal would satisfy my needs, but I don't know as much about your use case, so I'm curious. Would that proposal (in particular, part 2, the classic masked-array part) work for you? -- Nathaniel From efiring at hawaii.edu Wed Jun 29 17:21:07 2011 From: efiring at hawaii.edu (Eric Firing) Date: Wed, 29 Jun 2011 11:21:07 -1000 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: <4E0B9743.10608@hawaii.edu> On 06/29/2011 09:32 AM, Matthew Brett wrote: > Hi, > [...] > > Clearly there are some overlaps between what masked arrays are trying > to achieve and what Rs NA mechanisms are trying to achieve. Are they > really similar enough that they should function using the same API? > And if so, won't that be confusing? I think that's the question > that's being asked. And I think the answer is "no". No more confusing to people coming from R to numpy than views already are--with or without the NEP--and not *requiring* people to use any NA-related functionality beyond what they are used to from R. My understanding of the NEP is that it directly yields an API closely matching that of R, but with the opportunity, via views, to do more with less work, if one so desires. The present masked array module could be made more efficient if the NEP is implemented; regardless of whether this is done, the masked array module is not about to vanish, so anyone wanting precisely the masked array API will have it; and others remain free to ignore it (except for those of us involved in developing libraries such as matplotlib, which will have to support all variations of the new API along with the already-supported masked arrays). In addition, for new code, the full-blown masked array module may not be needed. A convenience it adds, however, is the automatic masking of invalid values: In [1]: np.ma.log(-1) Out[1]: masked I'm sure this horrifies some, but there are times and places where it is a genuine convenience, and preferable to having to use a separate operation to replace nan or inf with NA or whatever it ends up being. If np.seterr were extended to allow such automatic masking as an option, then the need for a separate masked array module would shrink further. I wouldn't mind having to use an explicit kwarg for ignoring NA in reduction methods. Eric > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From xscript at gmx.net Wed Jun 29 17:40:06 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Wed, 29 Jun 2011 23:40:06 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: (Nathaniel Smith's message of "Wed, 29 Jun 2011 13:41:24 -0700") References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> Message-ID: <87wrg471vt.fsf@ginnungagap.bsc.es> Nathaniel Smith writes: > I know that the part 1 of that proposal would satisfy my needs, but I > don't know as much about your use case, so I'm curious. Would that > proposal (in particular, part 2, the classic masked-array part) work > for you? I'm for the option of having a single API when you want to have NA elements, regardless of whether it's using masks or bit patterns. My question is whether your ufuncs should react differently depending on the type of array you're using (bit pattern vs mask). In the beginning I thought it could make sense, as you know how you have created the array. So if you're using masks, you're probably going to ignore the NAs (becase you've explicitly set them, and you don't want a NA as the result of your summation). *But*, the more API/semantics both approaches share, the better; so I'd say that its better that they show the *very same* behaviour (w.r.t. "skipna"). My concern is now about how to set the "skipna" in a "comfortable" way, so that I don't have to set it again and again as ufunc arguments: >>> a array([NA, 2, 3]) >>> b array([1, 2, NA]) >>> a + b array([NA, 2, NA]) >>> a.flags.skipna=True >>> b.flags.skipna=True >>> a + b array([1, 4, 3]) Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From njs at pobox.com Wed Jun 29 18:42:53 2011 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 29 Jun 2011 15:42:53 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <87wrg471vt.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> <87wrg471vt.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 2:40 PM, Llu?s wrote: > I'm for the option of having a single API when you want to have NA > elements, regardless of whether it's using masks or bit patterns. I understand the desire to avoid having two different APIS... [snip] > My concern is now about how to set the "skipna" in a "comfortable" way, > so that I don't have to set it again and again as ufunc arguments: > >>>> a > array([NA, 2, 3]) >>>> b > array([1, 2, NA]) >>>> a + b > array([NA, 2, NA]) >>>> a.flags.skipna=True >>>> b.flags.skipna=True >>>> a + b > array([1, 4, 3]) ...But... now you're introducing two different kinds of arrays with different APIs again? Ones where .skipna==True, and ones where .skipna==False? I know that this way it's not keyed on the underlying storage format, but if we support both bit patterns and mask arrays at the implementation level, then the only way to make them have identical APIs is if we completely disallow unmasking, and shared masks, and so forth. Which doesn't seem like it'd be very popular (and would make including the mask-based implementation pretty pointless). So I think we have to assume that they will have APIs that are at least somewhat different. And then it seems like with this proposal then we'd actually end up with *4* different APIs that any particular array might follow... (or maybe more, depending on how arrays that had both a bit-pattern and mask ended up working). That's why I was thinking the best solution might be to just bite the bullet and make the APIs *totally* different and non-overlapping, so it was always obvious which you were using and how they'd interact. But I don't know -- for my work I'd be happy to just pass skipna everywhere I needed it, and never unmask anything, and so forth, so maybe there's some reason why it's really important for the bit-pattern NA API to overlap more with the masked array API? -- Nathaniel From josef.pktd at gmail.com Wed Jun 29 19:57:59 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 29 Jun 2011 19:57:59 -0400 Subject: [Numpy-discussion] Multiply along axis In-Reply-To: <4E0B3F30.2010700@re-factory.de> References: <4E0B376E.6010706@re-factory.de> <4E0B3F30.2010700@re-factory.de> Message-ID: On Wed, Jun 29, 2011 at 11:05 AM, Robert Elsner wrote: > > Yeah great that was spot-on. And I thought I knew most of the slicing > tricks. I combined it with a slice object so that > > idx_obj = [ None for i in xrange(a.ndim) ] or idx_obj = [None] * a.ndim otherwise this is also what I do quite often Josef > idx_obj[axis] = slice(None) > > a * x[idx_object] > > works the way I want it. Suggestions are welcome but I am happy with the > quick solution you pointed out. Thanks > > > On 29.06.2011 16:38, Skipper Seabold wrote: >> On Wed, Jun 29, 2011 at 10:32 AM, Robert Elsner wrote: >>> Hello everyone, >>> >>> I would like to solve the following problem (preferably without >>> reshaping / flipping the array a). >>> >>> Assume I have a vector v of length x and an n-dimensional array a where >>> one dimension has length x as well. Now I would like to multiply the >>> vector v along a given axis of a. >>> >>> Some example code >>> >>> a = np.random.random((2,3)) >>> x = np.zeros(2) >>> >>> a * x ? # Fails because not broadcastable >>> >>> >>> So how do I multiply x with the columns of a so that for each column j >>> a[:,j] = a[:,j] * x >>> >>> without using a loop. Is there some (fast) fast way to accomplish that >>> with numpy/scipy? >>> >> Is this what you want? >> >> a * x[:,None] >> >> or >> >> a * ?x[:,np.newaxis] >> >> or more generally >> >> a * np.expand_dims(x,1) >> >> Skipper >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From jsseabold at gmail.com Wed Jun 29 22:35:41 2011 From: jsseabold at gmail.com (Skipper Seabold) Date: Wed, 29 Jun 2011 22:35:41 -0400 Subject: [Numpy-discussion] Two bugs in recfunctions.join_by with patch Message-ID: These two cases failed in recfunctions.join_by 1) the case for having either r1postfix or r2postfix as an empty string was not handled. 2) If there is more than one key and more than variable with a name collision. Patch and tests in a pull request here: https://github.com/numpy/numpy/pull/100 Skipper From Chris.Barker at noaa.gov Thu Jun 30 02:49:03 2011 From: Chris.Barker at noaa.gov (Chris Barker) Date: Wed, 29 Jun 2011 23:49:03 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0C1C5F.5060909@noaa.gov> On 6/27/11 9:53 AM, Charles R Harris wrote: > Some discussion of disk storage might also help. I don't see how the > rules can be enforced if two files are used, one for the mask and > another for the data, but that may just be something we need to live with. It seems it wouldn't be too big deal to extend the *.npy format to include the mask. Could one memmap both the data array and the mask? Netcdf (and assume hdf) have ways to support masks as well. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov From SSharma84 at slb.com Thu Jun 30 04:03:09 2011 From: SSharma84 at slb.com (Sachin Kumar Sharma) Date: Thu, 30 Jun 2011 08:03:09 +0000 Subject: [Numpy-discussion] Fitting Log normal or truncated log normal distibution to three points Message-ID: <75C2FED246299A478280FA1470EDA443472697F1@NL0230MBX06N2.DIR.slb.com> Hi, I have three points 10800, 81100, 582000. What is the easiest way of fitting a log normal and truncated log normal distribution to these three points using numpy. I would appreciate your reply for the same. Cheers Sachin ************************************************************************ Sachin Kumar Sharma Senior Geomodeler -------------- next part -------------- An HTML attachment was scrubbed... URL: From Deil.Christoph at googlemail.com Thu Jun 30 05:25:49 2011 From: Deil.Christoph at googlemail.com (Christoph Deil) Date: Thu, 30 Jun 2011 11:25:49 +0200 Subject: [Numpy-discussion] Fitting Log normal or truncated log normal distibution to three points In-Reply-To: <75C2FED246299A478280FA1470EDA443472697F1@NL0230MBX06N2.DIR.slb.com> References: <75C2FED246299A478280FA1470EDA443472697F1@NL0230MBX06N2.DIR.slb.com> Message-ID: On Jun 30, 2011, at 10:03 AM, Sachin Kumar Sharma wrote: > Hi, > > I have three points 10800, 81100, 582000. > > What is the easiest way of fitting a log normal and truncated log normal distribution to these three points using numpy. The lognormal and maximum likelihood fit is available in scipy.stats. It's easier to use this than to implement your own fit in numpy, although if you can't use scipy for some reason you can have a look there on how to implement it in numpy. The rest of this reply uses scipy.stats, so if you are only interested in numpy, please stop reading. >>> data = [10800, 81100, 582000] >>> import scipy.stats >>> print scipy.stats.lognorm.extradoc Lognormal distribution lognorm.pdf(x,s) = 1/(s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2) for x > 0, s > 0. If log x is normally distributed with mean mu and variance sigma**2, then x is log-normally distributed with shape paramter sigma and scale parameter exp(mu). >>> scipy.stats.lognorm.fit(data) /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/optimize/optimize.py:280: RuntimeWarning: invalid value encountered in subtract and max(abs(fsim[0]-fsim[1:])) <= ftol): /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/optimize/optimize.py:280: RuntimeWarning: invalid value encountered in absolute and max(abs(fsim[0]-fsim[1:])) <= ftol): (1.0, 30618.493505379971, 117675.94879488947) >>> scipy.stats.lognorm.fit(data, floc=0, fscale=1) [11.405078125000022, 0, 1] See http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html for further info on how the location and scale parameter are handled, basically the x in the lognormal.pdf formula above is x = (x_actual - loc) / scale. Now how to fit a truncated lognorm? First note that because log(x) can only be computed for x>0, scipy.stats.lognorm is already truncated to x > 0. >>> scipy.stats.lognorm.cdf(0, 1) # arguments: (x, s) 0.0 >>> scipy.stats.lognorm.cdf(1, 1) # arguments: (x, s) 0.5 >>> scipy.stats.lognorm.cdf(2, 1) # arguments: (x, s) 0.75589140421441725 As documented here, you can use parameters a and b to set the support of the distribution (i.e. the lower and upper truncation points): >>> help(scipy.stats.rv_continuous) | a : float, optional | Lower bound of the support of the distribution, default is minus | infinity. | b : float, optional | Upper bound of the support of the distribution, default is plus | infinity. However when I try to use the a, b parameters to call pdf() (as a simpler method than fit() to check if it works) I run into the following problem: >>> scipy.stats.lognorm.pdf(1, 1) 0.3989422804014327 >>> scipy.stats.lognorm(a=1).pdf(1, 1) Traceback (most recent call last): File "", line 1, in TypeError: pdf() takes exactly 2 arguments (3 given) >>> scipy.stats.lognorm(a=1).pdf(1) Traceback (most recent call last): File "", line 1, in File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/distributions.py", line 335, in pdf return self.dist.pdf(x, *self.args, **self.kwds) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/distributions.py", line 1113, in pdf place(output,cond,self._pdf(*goodargs) / scale) TypeError: _pdf() takes exactly 3 arguments (2 given) For a distribution without parameters besides (loc, scale), setting a works: >>> scipy.stats.norm(a=-2).pdf(3) 0.0044318484119380075 Is this a bug or am I simply using it wrong? It would be nice if http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html contained an example or two of how to use the a, b, xa, xb, xtol parameters of scipy.stats.rv_continuous . Christoph -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Thu Jun 30 08:49:36 2011 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 30 Jun 2011 08:49:36 -0400 Subject: [Numpy-discussion] Fitting Log normal or truncated log normal distibution to three points In-Reply-To: References: <75C2FED246299A478280FA1470EDA443472697F1@NL0230MBX06N2.DIR.slb.com> Message-ID: On Thu, Jun 30, 2011 at 5:25 AM, Christoph Deil wrote: > > On Jun 30, 2011, at 10:03 AM, Sachin Kumar Sharma wrote: > > Hi, > > I have three points 10800, 81100, 582000. > > What is the easiest way of fitting a log normal and truncated log normal > distribution to these three points using numpy. > > The lognormal and maximum likelihood fit is available in scipy.stats.?It's > easier to use this than to implement your own fit in numpy, although if you > can't use scipy for some reason you can have a look there on how to > implement it in numpy. > The rest of this reply uses scipy.stats, so if you are only interested in > numpy, please stop reading. >>>> data = [10800, 81100, 582000] >>>> import scipy.stats >>>> print scipy.stats.lognorm.extradoc > > Lognormal distribution > lognorm.pdf(x,s) = 1/(s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2) > for x > 0, s > 0. > If log x is normally distributed with mean mu and variance sigma**2, > then x is log-normally distributed with shape paramter sigma and scale > parameter exp(mu). > >>>> scipy.stats.lognorm.fit(data) > /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/optimize/optimize.py:280: > RuntimeWarning: invalid value encountered in subtract > ??and max(abs(fsim[0]-fsim[1:])) <= ftol): > /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/optimize/optimize.py:280: > RuntimeWarning: invalid value encountered in absolute > ??and max(abs(fsim[0]-fsim[1:])) <= ftol): > (1.0, 30618.493505379971, 117675.94879488947) >>>> scipy.stats.lognorm.fit(data, floc=0, fscale=1) > [11.405078125000022, 0, 1] > See?http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html for > further info on how the location and scale parameter are handled, basically > the x in the lognormal.pdf formula above is x = (x_actual - loc) / scale. > Now how to fit a truncated lognorm? > First note that because log(x) can only be computed for x>0, > scipy.stats.lognorm is already truncated to x > 0. >>>> scipy.stats.lognorm.cdf(0, 1) # arguments: (x, s) > 0.0 >>>> scipy.stats.lognorm.cdf(1, 1) # arguments: (x, s) > 0.5 >>>> scipy.stats.lognorm.cdf(2, 1) # arguments: (x, s) > 0.75589140421441725 > As documented here, you can use parameters a and b to set the support of the > distribution (i.e. the lower and upper truncation points): >>>> help(scipy.stats.rv_continuous) > ?| ?a : float, optional > ?| ? ? ?Lower bound of the support of the distribution, default is minus > ?| ? ? ?infinity. > ?| ?b : float, optional > ?| ? ? ?Upper bound of the support of the distribution, default is plus > ?| ? ? ?infinity. > However when I try to use the a, b parameters to call pdf() (as a simpler > method than fit() to check if it works) ?I run into the following problem: >>>> scipy.stats.lognorm.pdf(1, 1) > 0.3989422804014327 >>>> scipy.stats.lognorm(a=1).pdf(1, 1) > Traceback (most recent call last): > ??File "", line 1, in > TypeError: pdf() takes exactly 2 arguments (3 given) >>>> scipy.stats.lognorm(a=1).pdf(1) > Traceback (most recent call last): > ??File "", line 1, in > ??File > "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/distributions.py", > line 335, in pdf > ?? ?return self.dist.pdf(x, *self.args, **self.kwds) > ??File > "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/distributions.py", > line 1113, in pdf > ?? ?place(output,cond,self._pdf(*goodargs) / scale) > TypeError: _pdf() takes exactly 3 arguments (2 given) > For a distribution without parameters besides (loc, scale), setting a works: >>>> scipy.stats.norm(a=-2).pdf(3) > 0.0044318484119380075 > Is this a bug or am I simply using it wrong? > It would be nice > if?http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html contained > an example or two of how to use the a, b, xa, xb, xtol parameters of > scipy.stats.rv_continuous . in your last example a=-2 is ignored >>> stats.norm.pdf(3) 0.0044318484119380075 a, b, xa, xb are attributes of the distribution that are set when the distribution instance is created. changing a and b doesn't make much sense since you would truncate the support of the distribution without changing the distribution >>> stats.norm.pdf(-3) 0.0044318484119380075 >>> stats.norm.a = -2 >>> stats.norm.pdf(-3) 0.0 stats.norm and the other distributions is an instances, and if you change a in it it will stay this way, and might mess up other calculations you might want to do. xa, xb are only used internally as limits for the inversion of the cdf (limits to fsolve) and I don't think they have any other effect. what's a truncated lognormal ? left or right side truncated. I think it might need to be created by subclassing. there might be a direct way of estimating lognormal parameters from log(x).mean() and log(x).std() if x is a lognormal sample. (that's a scipy user question) Josef > Christoph > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > From matthew.brett at gmail.com Thu Jun 30 09:31:53 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 14:31:53 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 Message-ID: Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: ... [some proposals] In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: ############################################### A alternative-NEP on masking and missing values ############################################### The principle of this aNEP is to separate the APIs for masking and for missing values, according to * The current implementation of masked arrays * Nathaniel Smith's proposal. This discussion is only of the API, and not of the implementation. ************** Initialization ************** First, missing values can be set and be displayed as ``np.NA, NA``:: >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[>> np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True) As the initialization is not ambiguous, this can be written without ``masked=True``:: >>> np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True) ****** Ufuncs ****** By default, NA values propagate:: >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) >>> np.sum(na_arr) NA('float64') unless the ``skipna`` flag is set:: >>> np.sum(na_arr, skipna=True) 10.0 By default, masking does not propagate:: >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) >>> np.sum(masked_arr) 10.0 unless the ``propmsk`` flag is set:: >>> np.sum(masked_arr, propmsk=True) MASKED An array can be masked, and contain NA values:: >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) In the default case, the behavior is obvious:: >>> np.sum(both_arr) NA('float64') It's also obvious what to do with ``skipna=True``:: >>> np.sum(both_arr, skipna=True) 10.0 >>> np.sum(both_arr, skipna=True, propmsk=True) MASKED To break the tie between NA and MSK, NAs propagate harder:: >>> np.sum(both_arr, propmsk=True) NA('float64') ********** Assignment ********** is obvious in the NA case:: >>> arr = np.array([1.0, 2.0, 7.0]) >>> arr[2] = np.NA TypeError('dtype does not support NA') >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>> na_arr[2] = np.NA >>> na_arr array([1., 2., NA], dtype='NA[>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>> masked_arr[2] = np.NA TypeError('dtype does not support NA') >>> masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') >>> masked_arr.visible[2] = False >>> masked_arr array([1., 2., MASKED], masked=True) See y'all, Matthew From pgmdevlist at gmail.com Thu Jun 30 09:58:06 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 30 Jun 2011 15:58:06 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote: > ############################################### > A alternative-NEP on masking and missing values > ############################################### I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way. > ************** > Initialization > ************** > > First, missing values can be set and be displayed as ``np.NA, NA``:: > >>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') > array([1., 2., NA, 7.], dtype='NA[ > As the initialization is not ambiguous, this can be written without the NA > dtype:: > >>>> np.array([1.0, 2.0, np.NA, 7.0]) > array([1., 2., NA, 7.], dtype='NA[ > Masked values can be set and be displayed as ``np.MASKED, MASKED``:: > >>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) > array([1., 2., MASKED, 7.], masked=True) > > As the initialization is not ambiguous, this can be written without > ``masked=True``:: > >>>> np.array([1.0, 2.0, np.MASKED, 7.0]) > array([1., 2., MASKED, 7.], masked=True) I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here. > ****** > Ufuncs > ****** All fine. > > ********** > Assignment > ********** > > is obvious in the NA case:: > >>>> arr = np.array([1.0, 2.0, 7.0]) >>>> arr[2] = np.NA > TypeError('dtype does not support NA') >>>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>>> na_arr[2] = np.NA >>>> na_arr > array([1., 2., NA], dtype='NA[ > Direct assignnent in the masked case is magic and confusing, and so happens only > via the mask:: > >>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>>> masked_arr[2] = np.NA > TypeError('dtype does not support NA') >>>> masked_arr[2] = np.MASKED > TypeError('float() argument must be a string or a number') >>>> masked_arr.visible[2] = False >>>> masked_arr > array([1., 2., MASKED], masked=True) What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ? From charlesr.harris at gmail.com Thu Jun 30 10:17:04 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 30 Jun 2011 08:17:04 -0600 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett wrote: > Hi, > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: > > Anyway, it's pretty clear that in this particular case, there are two > > distinct features that different people want: the missing data > > feature, and the masked array feature. The more I think about it, the > > less I see how they can be combined into one dessert topping + floor > > wax solution. Here are three particular points where they seem to > > contradict each other: > ... > [some proposals] > > In the interest of making the discussion as concrete as possible, here > is my draft of an alternative proposal for NAs and masking, based on > Nathaniel's comments. Writing it, it seemed to me that Nathaniel is > right, that the ideas become much clearer when the NA idea and the > MASK idea are separate. Please do pitch in for things I may have > missed or misunderstood: > > ############################################### > A alternative-NEP on masking and missing values > ############################################### > > The principle of this aNEP is to separate the APIs for masking and for > missing > values, according to > > * The current implementation of masked arrays > * Nathaniel Smith's proposal. > > This discussion is only of the API, and not of the implementation. > > ************** > Initialization > ************** > > First, missing values can be set and be displayed as ``np.NA, NA``:: > > >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') > array([1., 2., NA, 7.], dtype='NA[ > As the initialization is not ambiguous, this can be written without the NA > dtype:: > > >>> np.array([1.0, 2.0, np.NA, 7.0]) > array([1., 2., NA, 7.], dtype='NA[ > Masked values can be set and be displayed as ``np.MASKED, MASKED``:: > > >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) > array([1., 2., MASKED, 7.], masked=True) > > As the initialization is not ambiguous, this can be written without > ``masked=True``:: > > >>> np.array([1.0, 2.0, np.MASKED, 7.0]) > array([1., 2., MASKED, 7.], masked=True) > > ****** > Ufuncs > ****** > > By default, NA values propagate:: > > >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) > >>> np.sum(na_arr) > NA('float64') > > unless the ``skipna`` flag is set:: > > >>> np.sum(na_arr, skipna=True) > 10.0 > > By default, masking does not propagate:: > > >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) > >>> np.sum(masked_arr) > 10.0 > > unless the ``propmsk`` flag is set:: > > >>> np.sum(masked_arr, propmsk=True) > MASKED > > An array can be masked, and contain NA values:: > > >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) > > In the default case, the behavior is obvious:: > > >>> np.sum(both_arr) > NA('float64') > > It's also obvious what to do with ``skipna=True``:: > > >>> np.sum(both_arr, skipna=True) > 10.0 > >>> np.sum(both_arr, skipna=True, propmsk=True) > MASKED > > To break the tie between NA and MSK, NAs propagate harder:: > > >>> np.sum(both_arr, propmsk=True) > NA('float64') > > ********** > Assignment > ********** > > is obvious in the NA case:: > > >>> arr = np.array([1.0, 2.0, 7.0]) > >>> arr[2] = np.NA > TypeError('dtype does not support NA') > >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') > >>> na_arr[2] = np.NA > >>> na_arr > array([1., 2., NA], dtype='NA[ > Direct assignnent in the masked case is magic and confusing, and so happens > only > via the mask:: > > >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) > >>> masked_arr[2] = np.NA > TypeError('dtype does not support NA') > >>> masked_arr[2] = np.MASKED > TypeError('float() argument must be a string or a number') > >>> masked_arr.visible[2] = False > >>> masked_arr > array([1., 2., MASKED], masked=True) > > See y'all, > > I honestly don't see the problem here. The difference isn't between masked_values/missing_values, it is between masked arrays and masked views of unmasked arrays. I think the view concept is central to what is going on. It may not be what folks are used to, but it strikes me as a clarifying advance rather than a mixed up confusion. Admittedly, it depends on the numpy centric ability to have views, but views are a wonderful thing. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.s.seljebotn at astro.uio.no Thu Jun 30 10:26:31 2011 From: d.s.seljebotn at astro.uio.no (Dag Sverre Seljebotn) Date: Thu, 30 Jun 2011 16:26:31 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: <4E0C8797.5050409@astro.uio.no> On 06/30/2011 04:17 PM, Charles R Harris wrote: > > > On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett > wrote: > > Hi, > > On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith > wrote: > > Anyway, it's pretty clear that in this particular case, there are two > > distinct features that different people want: the missing data > > feature, and the masked array feature. The more I think about it, the > > less I see how they can be combined into one dessert topping + floor > > wax solution. Here are three particular points where they seem to > > contradict each other: > ... > [some proposals] > > In the interest of making the discussion as concrete as possible, here > is my draft of an alternative proposal for NAs and masking, based on > Nathaniel's comments. Writing it, it seemed to me that Nathaniel is > right, that the ideas become much clearer when the NA idea and the > MASK idea are separate. Please do pitch in for things I may have > missed or misunderstood: > > ############################################### > A alternative-NEP on masking and missing values > ############################################### > > The principle of this aNEP is to separate the APIs for masking and > for missing > values, according to > > * The current implementation of masked arrays > * Nathaniel Smith's proposal. > > This discussion is only of the API, and not of the implementation. > > ************** > Initialization > ************** > > First, missing values can be set and be displayed as ``np.NA, NA``:: > > >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') > array([1., 2., NA, 7.], dtype='NA[ > As the initialization is not ambiguous, this can be written without > the NA > dtype:: > > >>> np.array([1.0, 2.0, np.NA, 7.0]) > array([1., 2., NA, 7.], dtype='NA[ > Masked values can be set and be displayed as ``np.MASKED, MASKED``:: > > >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) > array([1., 2., MASKED, 7.], masked=True) > > As the initialization is not ambiguous, this can be written without > ``masked=True``:: > > >>> np.array([1.0, 2.0, np.MASKED, 7.0]) > array([1., 2., MASKED, 7.], masked=True) > > ****** > Ufuncs > ****** > > By default, NA values propagate:: > > >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) > >>> np.sum(na_arr) > NA('float64') > > unless the ``skipna`` flag is set:: > > >>> np.sum(na_arr, skipna=True) > 10.0 > > By default, masking does not propagate:: > > >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) > >>> np.sum(masked_arr) > 10.0 > > unless the ``propmsk`` flag is set:: > > >>> np.sum(masked_arr, propmsk=True) > MASKED > > An array can be masked, and contain NA values:: > > >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) > > In the default case, the behavior is obvious:: > > >>> np.sum(both_arr) > NA('float64') > > It's also obvious what to do with ``skipna=True``:: > > >>> np.sum(both_arr, skipna=True) > 10.0 > >>> np.sum(both_arr, skipna=True, propmsk=True) > MASKED > > To break the tie between NA and MSK, NAs propagate harder:: > > >>> np.sum(both_arr, propmsk=True) > NA('float64') > > ********** > Assignment > ********** > > is obvious in the NA case:: > > >>> arr = np.array([1.0, 2.0, 7.0]) > >>> arr[2] = np.NA > TypeError('dtype does not support NA') > >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') > >>> na_arr[2] = np.NA > >>> na_arr > array([1., 2., NA], dtype='NA[ > Direct assignnent in the masked case is magic and confusing, and so > happens only > via the mask:: > > >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) > >>> masked_arr[2] = np.NA > TypeError('dtype does not support NA') > >>> masked_arr[2] = np.MASKED > TypeError('float() argument must be a string or a number') > >>> masked_arr.visible[2] = False > >>> masked_arr > array([1., 2., MASKED], masked=True) > > See y'all, > > > I honestly don't see the problem here. The difference isn't between > masked_values/missing_values, it is between masked arrays and masked > views of unmasked arrays. I think the view concept is central to what is > going on. It may not be what folks are used to, but it strikes me as a > clarifying advance rather than a mixed up confusion. Admittedly, it > depends on the numpy centric ability to have views, but views are a > wonderful thing. So a) how do you propose that reductions behave?, b) what semantics for the []= operator do you propose? That would clarify why you don't see a problem.. Dag Sverre From charlesr.harris at gmail.com Thu Jun 30 10:27:43 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 30 Jun 2011 08:27:43 -0600 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: On Thu, Jun 30, 2011 at 8:17 AM, Charles R Harris wrote: > > > On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett wrote: > >> Hi, >> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith wrote: >> > Anyway, it's pretty clear that in this particular case, there are two >> > distinct features that different people want: the missing data >> > feature, and the masked array feature. The more I think about it, the >> > less I see how they can be combined into one dessert topping + floor >> > wax solution. Here are three particular points where they seem to >> > contradict each other: >> ... >> [some proposals] >> >> In the interest of making the discussion as concrete as possible, here >> is my draft of an alternative proposal for NAs and masking, based on >> Nathaniel's comments. Writing it, it seemed to me that Nathaniel is >> right, that the ideas become much clearer when the NA idea and the >> MASK idea are separate. Please do pitch in for things I may have >> missed or misunderstood: >> >> ############################################### >> A alternative-NEP on masking and missing values >> ############################################### >> >> The principle of this aNEP is to separate the APIs for masking and for >> missing >> values, according to >> >> * The current implementation of masked arrays >> * Nathaniel Smith's proposal. >> >> This discussion is only of the API, and not of the implementation. >> >> ************** >> Initialization >> ************** >> >> First, missing values can be set and be displayed as ``np.NA, NA``:: >> >> >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >> array([1., 2., NA, 7.], dtype='NA[> >> As the initialization is not ambiguous, this can be written without the NA >> dtype:: >> >> >>> np.array([1.0, 2.0, np.NA, 7.0]) >> array([1., 2., NA, 7.], dtype='NA[> >> Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >> >> >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) >> array([1., 2., MASKED, 7.], masked=True) >> >> As the initialization is not ambiguous, this can be written without >> ``masked=True``:: >> >> >>> np.array([1.0, 2.0, np.MASKED, 7.0]) >> array([1., 2., MASKED, 7.], masked=True) >> >> ****** >> Ufuncs >> ****** >> >> By default, NA values propagate:: >> >> >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) >> >>> np.sum(na_arr) >> NA('float64') >> >> unless the ``skipna`` flag is set:: >> >> >>> np.sum(na_arr, skipna=True) >> 10.0 >> >> By default, masking does not propagate:: >> >> >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) >> >>> np.sum(masked_arr) >> 10.0 >> >> unless the ``propmsk`` flag is set:: >> >> >>> np.sum(masked_arr, propmsk=True) >> MASKED >> >> An array can be masked, and contain NA values:: >> >> >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) >> >> In the default case, the behavior is obvious:: >> >> >>> np.sum(both_arr) >> NA('float64') >> >> It's also obvious what to do with ``skipna=True``:: >> >> >>> np.sum(both_arr, skipna=True) >> 10.0 >> >>> np.sum(both_arr, skipna=True, propmsk=True) >> MASKED >> >> To break the tie between NA and MSK, NAs propagate harder:: >> >> >>> np.sum(both_arr, propmsk=True) >> NA('float64') >> >> ********** >> Assignment >> ********** >> >> is obvious in the NA case:: >> >> >>> arr = np.array([1.0, 2.0, 7.0]) >> >>> arr[2] = np.NA >> TypeError('dtype does not support NA') >> >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >> >>> na_arr[2] = np.NA >> >>> na_arr >> array([1., 2., NA], dtype='NA[> >> Direct assignnent in the masked case is magic and confusing, and so >> happens only >> via the mask:: >> >> >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >> >>> masked_arr[2] = np.NA >> TypeError('dtype does not support NA') >> >>> masked_arr[2] = np.MASKED >> TypeError('float() argument must be a string or a number') >> >>> masked_arr.visible[2] = False >> >>> masked_arr >> array([1., 2., MASKED], masked=True) >> >> See y'all, >> >> > I honestly don't see the problem here. The difference isn't between > masked_values/missing_values, it is between masked arrays and masked views > of unmasked arrays. I think the view concept is central to what is going on. > It may not be what folks are used to, but it strikes me as a clarifying > advance rather than a mixed up confusion. Admittedly, it depends on the > numpy centric ability to have views, but views are a wonderful thing. > > OK, I can see a problem in that currently the only way to unmask a value is by assignment of a valid value to the underlying data array, that is the missing data idea. For masked data, it might be convenient to have something that only affected the mask instead of having to take another view of the unmasked data and reconstructing the mask with some modifications. So that could maybe be done with a "soft" np.CLEAR that only worked on views of unmasked arrays. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 10:50:53 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 09:50:53 -0500 Subject: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray In-Reply-To: <87ei2c8o93.fsf@ginnungagap.bsc.es> References: <6162FE3A-5F43-4B1A-8F10-2810F7C1DBE4@gmail.com> <87vcvqt9s0.fsf@ginnungagap.bsc.es> <87ei2c8o93.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 1:51 PM, Llu?s wrote: > Mark Wiebe writes: > [...] > > I think that deciding on the value of NA signal values boils down to > > this question: should 3rd party code be able to interpret missing > data > > information stored in the separate mask array? > > > I'm tossing around some variations of ideas using the iterator to > > provide a buffered mask-based interface that works uniformly with both > > masked arrays and NA dtypes. This way 3rd party C code only needs to > > implement one missing data mechanism to fully support both of NumPy's > > missing data mechanisms. > > Nice. If non-numpy C code is bound to see it as an array (i.e., _always_ > oblivious to the mask concept), then you should probably do what I said > about "(un)merging" the bit pattern and mask-based NAs, but in this case > can be done on each block given by the iteration window. > My hands are a little bit tied because of ABI compatibility, but I'm thinking of ways I can cause 3rd party C code to fail if it doesn't ask for the data with the mask when it's masked. There's still the possibility of giving a finer granularity interface > where both are explicitly accessed, but this will probably add yet > another set of API functions (although the merging interface can be > implemented on top of this explicit raw iteration interface). > Things should be as simple as possible, but having layers of lower level stuff and higher level stuff is good. This is why, for instance, I introduced the where= parameter to ufuncs, because it's another useful way of using the same low-level mechanisms. BTW, this has some overlapping with a mail Travis sent long ago about > dynamically filling the backing byffer contents (in this case with the > "merged" NA data for 3rd parties). > > It might prove completely unsatisfactory (w.r.t. performance), but you > could also fake a bit-pattern-only sequential array by using mprotect to > detect the memory accesses and trigger then the production of the merged > data. This provides means for code using the simple buffer protocol, > without duplicating the whole structure for NA merges. > > This can be complicated even more with some simple strided pattern > detection to diminish the number of segfaults, as the shape is known. > Someone else will have to do stuff like this... ;) -Mark > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 10:52:52 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 09:52:52 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B69F1.9000805@astro.uio.no> References: <4E0B3822.5010908@astro.uio.no> <4E0B69F1.9000805@astro.uio.no> Message-ID: On Wed, Jun 29, 2011 at 1:07 PM, Dag Sverre Seljebotn < d.s.seljebotn at astro.uio.no> wrote: > On 06/29/2011 07:38 PM, Mark Wiebe wrote: > > On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn > > > wrote: > > > > On 06/29/2011 03:45 PM, Matthew Brett wrote: > > > Hi, > > > > > > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe > > wrote: > > >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew > > Brett> > > >> wrote: > > >>> > > >>> Hi, > > >>> > > >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith > > wrote: > > >>> ... > > >>>> (You might think, what difference does it make if you *can* > > unmask an > > >>>> item? Us missing data folks could just ignore this feature. > But: > > >>>> whatever we end up implementing is something that I will have > to > > >>>> explain over and over to different people, most of them not > > >>>> particularly sophisticated programmers. And there's just no > > sensible > > >>>> way to explain this idea that if you store some particular > > value, then > > >>>> it replaces the old value, but if you store NA, then the old > > value is > > >>>> still there. > > >>> > > >>> Ouch - yes. No question, that is difficult to explain. Well, > I > > >>> think the explanation might go like this: > > >>> > > >>> "Ah, yes, well, that's because in fact numpy records missing > > values by > > >>> using a 'mask'. So when you say `a[3] = np.NA', what you mean > is, > > >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`" > > >>> > > >>> Is that fair? > > >> > > >> My favorite way of explaining it would be to have a grid of > > numbers written > > >> on paper, then have several cardboards with holes poked in them > > in different > > >> configurations. Placing these cardboard masks in front of the > > grid would > > >> show different sets of non-missing data, without affecting the > > values stored > > >> on the paper behind them. > > > > > > Right - but here of course you are trying to explain the mask, and > > > this is Nathaniel's point, that in order to explain NAs, you have > to > > > explain masks, and so, even at a basic level, the fusion of the > two > > > ideas is obvious, and already confusing. I mean this: > > > > > > a[3] = np.NA > > > > > > "Oh, so you just set the a[3] value to have some missing value > code?" > > > > > > "Ah - no - in fact what I did was set a associated mask in > position > > > a[3] so that you can't any longer see the previous value of a[3]" > > > > > > "Huh. You mean I have a mask for every single value in order to > be > > > able to blank out a[3]? It looks like an assignment. I mean, it > > > looks just like a[3] = 4. But I guess it isn't?" > > > > > > "Er..." > > > > > > I think Nathaniel's point is a very good one - these are separate > > > ideas, np.NA and np.IGNORE, and a joint implementation is bound to > > > draw them together in the mind of the user. Apart from anything > > > else, the user has to know that, if they want a single NA value in > an > > > array, they have to add a mask size array.shape in bytes. They > have > > > to know then, that NA is implemented by masking, and then the 'NA > for > > > free by adding masking' idea breaks down and starts to feel like a > > > kludge. > > > > > > The counter argument is of course that, in time, the > > implementation of > > > NA with masking will seem as obvious and intuitive, as, say, > > > broadcasting, and that we are just reacting from lack of > experience > > > with the new API. > > > > However, no matter how used we get to this, people coming from almost > > any other tool (in particular R) will keep think it is > > counter-intuitive. Why set up a major semantic incompatability that > > people then have to overcome in order to start using NumPy. > > > > > > I'm not aware of a semantic incompatibility. I believe R doesn't support > > views like NumPy does, so the things you have to do to see masking > > semantics aren't even possible in R. > > Well, whether the same feature is possible or not in R is irrelevant to > whether a semantic incompatability would exist. > > Views themselves are a *major* semantic incompatability, and are highly > confusing at first to MATLAB/Fortran/R people. However they have major > advantages outweighing the disadvantage of having to caution new users. > > But there's simply no precedence anywhere for an assignment that doesn't > erase the old value for a particular input value, and the advantages > seem pretty minor (well, I think it is ugly in its own right, but that > is besides the point...) > I disagree that there's no precedent, but maybe there isn't something which is exactly the same as my design. The whole "actual real literal assignment" thought process leads to considerations of little gnomes writing numbers on pieces of paper inside your computer... -Mark > > Dag Sverre > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 11:06:24 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 10:06:24 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 2:32 PM, Matthew Brett wrote: > Hi, > > On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe wrote: > > On Wed, Jun 29, 2011 at 8:20 AM, Llu?s wrote: > >> > >> Matthew Brett writes: > >> > >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys > >> >> the idea that the entry is still there, but we're just ignoring it. > Of > >> >> course, that goes against common convention, but it might be easier > to > >> >> explain. > >> > >> > I think Nathaniel's point is that np.IGNORE is a different idea than > >> > np.NA, and that is why joining the implementations can lead to > >> > conceptual confusion. > >> > >> This is how I see it: > >> > >> >>> a = np.array([0, 1, 2], dtype=int) > >> >>> a[0] = np.NA > >> ValueError > >> >>> e = np.array([np.NA, 1, 2], dtype=int) > >> ValueError > >> >>> b = np.array([np.NA, 1, 2], dtype=np.maybe(int)) > >> >>> m = np.array([np.NA, 1, 2], dtype=int, masked=True) > >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True) > >> >>> b[1] = np.NA > >> >>> np.sum(b) > >> np.NA > >> >>> np.sum(b, skipna=True) > >> 2 > >> >>> b.mask > >> None > >> >>> m[1] = np.NA > >> >>> np.sum(m) > >> 2 > >> >>> np.sum(m, skipna=True) > >> 2 > >> >>> m.mask > >> [False, False, True] > >> >>> bm[1] = np.NA > >> >>> np.sum(bm) > >> 2 > >> >>> np.sum(bm, skipna=True) > >> 2 > >> >>> bm.mask > >> [False, False, True] > >> > >> So: > >> > >> * Mask takes precedence over bit pattern on element assignment. There's > >> still the question of how to assign a bit pattern NA when the mask is > >> active. > >> > >> * When using mask, elements are automagically skipped. > >> > >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False" > >> > >> * When using bit pattern + mask, it might make sense to have the initial > >> values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True, > >> False, True]" and "np.sum(bm) == np.NA") > > > > There seems to be a general idea that masks and NA bit patterns imply > > particular differing semantics, something which I think is simply false. > > Well - first - it's helpful surely to separate the concepts and the > implementation. > > Concepts / use patterns (as delineated by Nathaniel): > A) missing values == 'np.NA' in my emails. Can we call that CMV > (concept missing values)? > B) masks == np.IGNORE in my emails . CMSK (concept masks)? > This is a different conceptual model than I'm proposing in the NEP. This is also exactly what I was trying to clarify in the first email in this thread under the headings "Missing Data Abstraction" and "Implementation Techniques". Masks are *just* an implementation technique. They imply nothing more, except through previously established conventions such as in various bitmasks, image masks, numpy.ma and others. masks != np.IGNORE bit patterns != np.NA Masks vs bit patterns and R's default NA vs rm.na NA semantics are completely independent, except where design choices are made that they should be related. I think they should be unrelated, masks and bit patterns are two approaches to solving the same problem. > > Implementations > 1) bit-pattern == na-dtype - how about we call that IBP > (implementation bit patten)? > 2) array.mask. IM (implementation mask)? > > Nathaniel implied that: > > CMV implies: sum([np.NA, 1]) == np.NA > CMSK implies sum([np.NA, 1]) == 1 > > and indeed, that's how R and masked arrays respectively behave. R and numpy.ma. If we're trying to be clear about our concepts and implementations, numpy.ma is just one possible implementation of masked arrays. > So I > think it's reasonable to say that at least R thought that the bitmask > implied the first and Pierre and others thought the mask meant the > second. > R's model is based on years of experience and a model of what missing values implies, the bitmask implies nothing about the behavior of NA. > > The NEP as it stands thinks of CMV and and CM as being different views > of the same thing, Please correct me if I'm wrong. > > > Both NaN and Inf are implemented in hardware with the same idea as the NA > > bit pattern, but they do not follow NA missing value semantics. > > Right - and that doesn't affect the argument, because the argument is > about the concepts and not the implementation. > You just said R thought bitmasks implied something, and you're saying masked arrays imply something. If the argument is just about the missing value concepts, neither of these should be in the present discussion. > > > As far as I can tell, the only required difference between them is that > NA > > bit patterns must destroy the data. Nothing else. > > I think Nathaniel's point was about the expected default behavior in > the different concepts. > > > Everything on top of that > > is a choice of API and interface mechanisms. I want them to behave > exactly > > the same except for that necessary difference, so that it will be > possible > > to use the *exact same Python code* with either approach. > > Right. And Nathaniel's point is that that desire leads to fusion of > the two ideas into one when they should be separated. For example, if > I understand correctly: > > >>> a = np.array([1.0, 2.0, 3, 7.0], masked=True) > >>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') > >>> a[3] = np.NA # actual real hand-on-heart assignment > >>> b[3] = np.NA # magic mask setting although it looks the same > Why is one "magic" and the other "real"? All of this is already sitting on 100 layers of abstraction above electrons and atoms. If we're talking about "real," maybe we should be programming in machine code or using breadboards with individual transistors. > > > Say you're using NA dtypes, and suddenly you think, "what if I > temporarily > > treated these as NA too". Now you have to copy your whole array to avoid > > destroying your data! The NA bit pattern didn't save you memory here... > Say > > you're using masks, and it turns out you didn't actually need masking > > semantics. If they're different, you now have to do lots of code changes > to > > switch to NA dtypes! > > I personally have not run across that case. I'd imagine that, if you > knew you wanted to do something so explicitly masking-like, you'd > start with the masking interface. > People's use cases change over time, and sometimes one person's code is useful for others. I'd prefer to let people share. Clearly there are some overlaps between what masked arrays are trying > to achieve and what Rs NA mechanisms are trying to achieve. Are they > really similar enough that they should function using the same API? > Yes. > And if so, won't that be confusing? No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already confusing. -Mark > I think that's the question > that's being asked. > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 11:07:58 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 10:07:58 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <877h84a48m.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 1:20 PM, Llu?s wrote: > Mark Wiebe writes: > > > There seems to be a general idea that masks and NA bit patterns imply > > particular differing semantics, something which I think is simply > > false. > > Well, my example contained a difference (the need for the "skipna=True" > argument) precisely because it seemed that there was some need for > different defaults. > > Honestly, I think this difference breaks the POLA (principle of least > astonishment). > > > [...] > > As far as I can tell, the only required difference between them is > > that NA bit patterns must destroy the data. Nothing else. Everything > > on top of that is a choice of API and interface mechanisms. I want > > them to behave exactly the same except for that necessary difference, > > so that it will be possible to use the *exact same Python code* with > > either approach. > > I completely agree. What I'd suggest is a global and/or per-object > "ndarray.flags.skipna" for people like me that just want to ignore these > entries without caring about setting it on each operaion (or the other > way around, depends on the default behaviour). > > The downside is that it adds yet another tweaking knob, which is not > desirable... > One way around this would be to create an ndarray subclass which changes that default. Currently this would not be possible to do nicely, but with the _numpy_ufunc_ idea I proposed in a separate thread a while back, this could work. -Mark > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 11:11:10 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 10:11:10 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B9743.10608@hawaii.edu> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <4E0B9743.10608@hawaii.edu> Message-ID: On Wed, Jun 29, 2011 at 4:21 PM, Eric Firing wrote: > On 06/29/2011 09:32 AM, Matthew Brett wrote: > > Hi, > > > [...] > > > > Clearly there are some overlaps between what masked arrays are trying > > to achieve and what Rs NA mechanisms are trying to achieve. Are they > > really similar enough that they should function using the same API? > > And if so, won't that be confusing? I think that's the question > > that's being asked. > > And I think the answer is "no". No more confusing to people coming from > R to numpy than views already are--with or without the NEP--and not > *requiring* people to use any NA-related functionality beyond what they > are used to from R. > > My understanding of the NEP is that it directly yields an API closely > matching that of R, but with the opportunity, via views, to do more with > less work, if one so desires. The present masked array module could be > made more efficient if the NEP is implemented; regardless of whether > this is done, the masked array module is not about to vanish, so anyone > wanting precisely the masked array API will have it; and others remain > free to ignore it (except for those of us involved in developing > libraries such as matplotlib, which will have to support all variations > of the new API along with the already-supported masked arrays). > > In addition, for new code, the full-blown masked array module may not be > needed. A convenience it adds, however, is the automatic masking of > invalid values: > > In [1]: np.ma.log(-1) > Out[1]: masked > > I'm sure this horrifies some, but there are times and places where it is > a genuine convenience, and preferable to having to use a separate > operation to replace nan or inf with NA or whatever it ends up being. > I added a mechanism to support this idea with the NA dtypes approach, spelled 'NA[f8,InfNan]'. Here, all Infs and NaNs are treated as NA by the system. -Mark If np.seterr were extended to allow such automatic masking as an option, > then the need for a separate masked array module would shrink further. > I wouldn't mind having to use an explicit kwarg for ignoring NA in > reduction methods. > > Eric > > > > > > See you, > > > > Matthew > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 11:15:45 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 10:15:45 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> <87wrg471vt.fsf@ginnungagap.bsc.es> Message-ID: On Wed, Jun 29, 2011 at 5:42 PM, Nathaniel Smith wrote: > On Wed, Jun 29, 2011 at 2:40 PM, Llu?s wrote: > > I'm for the option of having a single API when you want to have NA > > elements, regardless of whether it's using masks or bit patterns. > > I understand the desire to avoid having two different APIS... > > [snip] > > My concern is now about how to set the "skipna" in a "comfortable" way, > > so that I don't have to set it again and again as ufunc arguments: > > > >>>> a > > array([NA, 2, 3]) > >>>> b > > array([1, 2, NA]) > >>>> a + b > > array([NA, 2, NA]) > >>>> a.flags.skipna=True > >>>> b.flags.skipna=True > >>>> a + b > > array([1, 4, 3]) > > ...But... now you're introducing two different kinds of arrays with > different APIs again? Ones where .skipna==True, and ones where > .skipna==False? > > I know that this way it's not keyed on the underlying storage format, > but if we support both bit patterns and mask arrays at the > implementation level, then the only way to make them have identical > APIs is if we completely disallow unmasking, and shared masks, and so > forth. The right set of these conditions has been in the NEP from the beginning. Unmasking without value assignment is disallowed - the only way to "see behind the mask" or to share masks is with views. My impression is than more people are concerned with sharing the same data between different masks, something also supported through views. -Mark > Which doesn't seem like it'd be very popular (and would make > including the mask-based implementation pretty pointless). So I think > we have to assume that they will have APIs that are at least somewhat > different. And then it seems like with this proposal then we'd > actually end up with *4* different APIs that any particular array > might follow... (or maybe more, depending on how arrays that had both > a bit-pattern and mask ended up working). > > That's why I was thinking the best solution might be to just bite the > bullet and make the APIs *totally* different and non-overlapping, so > it was always obvious which you were using and how they'd interact. > But I don't know -- for my work I'd be happy to just pass skipna > everywhere I needed it, and never unmask anything, and so forth, so > maybe there's some reason why it's really important for the > bit-pattern NA API to overlap more with the masked array API? > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 11:18:01 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 10:18:01 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0C1C5F.5060909@noaa.gov> References: <4E0C1C5F.5060909@noaa.gov> Message-ID: On Thu, Jun 30, 2011 at 1:49 AM, Chris Barker wrote: > On 6/27/11 9:53 AM, Charles R Harris wrote: > > Some discussion of disk storage might also help. I don't see how the > > rules can be enforced if two files are used, one for the mask and > > another for the data, but that may just be something we need to live > with. > > It seems it wouldn't be too big deal to extend the *.npy format to > include the mask. > > Could one memmap both the data array and the mask? > This I haven't thought about too much yet, but I don't see why not. This does provide a back door into the mask which violates the abstractions, so I would want it to be an extremely narrow special case. -Mark > > Netcdf (and assume hdf) have ways to support masks as well. > > -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Thu Jun 30 11:38:55 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 16:38:55 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM wrote: > > On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote: >> ############################################### >> A alternative-NEP on masking and missing values >> ############################################### > > I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way. > > >> ************** >> Initialization >> ************** >> >> First, missing values can be set and be displayed as ``np.NA, NA``:: >> >>>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >> ? ?array([1., 2., NA, 7.], dtype='NA[> >> As the initialization is not ambiguous, this can be written without the NA >> dtype:: >> >>>>> np.array([1.0, 2.0, np.NA, 7.0]) >> ? ?array([1., 2., NA, 7.], dtype='NA[> >> Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >> >>>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) >> ? ?array([1., 2., MASKED, 7.], masked=True) >> >> As the initialization is not ambiguous, this can be written without >> ``masked=True``:: >> >>>>> np.array([1.0, 2.0, np.MASKED, 7.0]) >> ? ?array([1., 2., MASKED, 7.], masked=True) > > I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here. If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then obviously I mean it should be masked, so the 'masked=True' here is completely redundant, yes, I agree. And in fact: np.array([1.0, 2.0, np.MASKED, 7.0], masked=False) should raise an error. On the other hand, if I make a normal array: arr = np.array([1.0, 2.0, 7.0]) and then do this: arr.visible[2] = False then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly. This somewhat breaks expectations though, because you might just have made a largish mask array without having any clue that that had happened. > >> ****** >> Ufuncs >> ****** > > All fine. >> >> ********** >> Assignment >> ********** >> >> is obvious in the NA case:: >> >>>>> arr = np.array([1.0, 2.0, 7.0]) >>>>> arr[2] = np.NA >> ? ?TypeError('dtype does not support NA') >>>>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>>>> na_arr[2] = np.NA >>>>> na_arr >> ? ?array([1., 2., NA], dtype='NA[ > OK > > >> >> Direct assignnent in the masked case is magic and confusing, and so happens only >> via the mask:: >> >>>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>>>> masked_arr[2] = np.NA >> ? ?TypeError('dtype does not support NA') >>>>> masked_arr[2] = np.MASKED >> ? ?TypeError('float() argument must be a string or a number') >>>>> masked_arr.visible[2] = False >>>>> masked_arr >> ? ?array([1., 2., MASKED], masked=True) > > What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ? Well, for the np.NA case, this is straightforward: na_arr[2] = 3 It's just assignment. For ``masked_array[2] = 3`` - I don't know, I guess whatever we are used to. What do you think? Best, Matthew From pgmdevlist at gmail.com Thu Jun 30 12:03:30 2011 From: pgmdevlist at gmail.com (Pierre GM) Date: Thu, 30 Jun 2011 18:03:30 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: <277B3D4F-5DA1-4DFA-8666-729E9382D128@gmail.com> On Jun 30, 2011, at 5:38 PM, Matthew Brett wrote: > Hi, > > On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM wrote: >> >> On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote: >>> ############################################### >>> A alternative-NEP on masking and missing values >>> ############################################### >> >> I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way. >> >> >>> ************** >>> Initialization >>> ************** >>> >>> First, missing values can be set and be displayed as ``np.NA, NA``:: >>> >>>>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >>> array([1., 2., NA, 7.], dtype='NA[>> >>> As the initialization is not ambiguous, this can be written without the NA >>> dtype:: >>> >>>>>> np.array([1.0, 2.0, np.NA, 7.0]) >>> array([1., 2., NA, 7.], dtype='NA[>> >>> Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >>> >>>>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) >>> array([1., 2., MASKED, 7.], masked=True) >>> >>> As the initialization is not ambiguous, this can be written without >>> ``masked=True``:: >>> >>>>>> np.array([1.0, 2.0, np.MASKED, 7.0]) >>> array([1., 2., MASKED, 7.], masked=True) >> >> I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here. > > If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then > obviously I mean it should be masked, so the 'masked=True' here is > completely redundant, yes, I agree. And in fact: > > np.array([1.0, 2.0, np.MASKED, 7.0], masked=False) > > should raise an error. On the other hand, if I make a normal array: > > arr = np.array([1.0, 2.0, 7.0]) > > and then do this: > > arr.visible[2] = False > > then either I should raise an error (it's not a masked array), or, > more magically, construct a mask on the fly. This somewhat breaks > expectations though, because you might just have made a largish mask > array without having any clue that that had happened. Well, I'd expect an error to be raised when assigning a NA if the initial array is not NA friendly. The 'magical' creation of a mask would be nice, but is probably too magic and best left alone. >> >>> >>> Direct assignnent in the masked case is magic and confusing, and so happens only >>> via the mask:: >>> >>>>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>>>>> masked_arr[2] = np.NA >>> TypeError('dtype does not support NA') >>>>>> masked_arr[2] = np.MASKED >>> TypeError('float() argument must be a string or a number') >>>>>> masked_arr.visible[2] = False >>>>>> masked_arr >>> array([1., 2., MASKED], masked=True) >> >> What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ? > > Well, for the np.NA case, this is straightforward: > > na_arr[2] = 3 > > It's just assignment. For ``masked_array[2] = 3`` - I don't know, I > guess whatever we are used to. What do you think? Ahah, that depends. With a = np.array([1., np.NA, 3.]), then a[1]=2. should raise an error, as Mark suggests: you can't "unmask" a missing value, you need to create a view of the initial array then "unmask". It's the equivalent of a hard mask. With a = np.array([1., np.IGNORE, 3.]), then a[1]=2. should give np.array([1.,2.,3.]) and a.mask=[False,False,False]. That's a soft mask. At least, that's how I see it... P. From strang at nmr.mgh.harvard.edu Thu Jun 30 12:04:12 2011 From: strang at nmr.mgh.harvard.edu (Gary Strangman) Date: Thu, 30 Jun 2011 12:04:12 -0400 (EDT) Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: > Clearly there are some overlaps between what masked arrays are > trying to achieve and what Rs NA mechanisms are trying to achieve. > ?Are they really similar enough that they should function using > the same API? > > Yes. > > And if so, won't that be confusing? > > No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already > confusing. As one who's been silently following (most of) this thread, and a heavy R and numpy user, perhaps I should chime in briefly here with a use case. I more-or-less always work with partially masked data, like Matthew, but not numpy masked arrays because the memory overhead is prohibitive. And, sad to say, my experiments don't always go perfectly. I therefore have arrays in which there is /both/ (1) data that is simply missing (np.NA?)--it never had a value and never will--as well as simultaneously (2) data that that is temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask different portions for different purposes/analyses. I consider these two separate, completely independent issues and I unfortunately currently have to kluge a lot to handle this. Concretely, consider a list of 100,000 observations (rows), with 12 measures per observation-row (a 100,000 x 12 array). Every now and then, sprinkled throughout this array, I have missing values (someone didn't answer a question, or a computer failed to record a response, or whatever). For some analyses I want to mask the whole row (e.g., complete-case analysis), leaving me with array entries that should be tagged with all 4 possible labels: 1) not masked, not missing 2) masked, not missing 3) not masked, missing 4) masked, missing Obviously #4 is "overkill" ... but only until I want to unmask that row. At that point, I need to be sure that missing values remain missing when unmasked. Can a single API really handle this? -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. From xscript at gmx.net Thu Jun 30 12:12:33 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 18:12:33 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: (Mark Wiebe's message of "Thu, 30 Jun 2011 10:07:58 -0500") References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <877h84a48m.fsf@ginnungagap.bsc.es> Message-ID: <87zkkzti1a.fsf@ginnungagap.bsc.es> Mark Wiebe writes: > On Wed, Jun 29, 2011 at 1:20 PM, Llu?s wrote: > [...] >> As far as I can tell, the only required difference between them is >> that NA bit patterns must destroy the data. Nothing else. Everything >> on top of that is a choice of API and interface mechanisms. I want >> them to behave exactly the same except for that necessary difference, >> so that it will be possible to use the *exact same Python code* with >> either approach. > I completely agree. What I'd suggest is a global and/or per-object > "ndarray.flags.skipna" for people like me that just want to ignore these > entries without caring about setting it on each operaion (or the other > way around, depends on the default behaviour). > The downside is that it adds yet another tweaking knob, which is not > desirable... > One way around this would be to create an ndarray subclass which > changes that default. Currently this would not be possible to do > nicely, but with the _numpy_ufunc_ idea I proposed in a separate > thread a while back, this could work. That does indeed sound good :) Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From mwwiebe at gmail.com Thu Jun 30 12:13:00 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 11:13:00 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman wrote: > > Clearly there are some overlaps between what masked arrays are >> trying to achieve and what Rs NA mechanisms are trying to achieve. >> Are they really similar enough that they should function using >> the same API? >> >> Yes. >> >> And if so, won't that be confusing? >> >> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are >> already >> confusing. >> > > As one who's been silently following (most of) this thread, and a heavy R > and numpy user, perhaps I should chime in briefly here with a use case. I > more-or-less always work with partially masked data, like Matthew, but not > numpy masked arrays because the memory overhead is prohibitive. And, sad to > say, my experiments don't always go perfectly. I therefore have arrays in > which there is /both/ (1) data that is simply missing (np.NA?)--it never had > a value and never will--as well as simultaneously (2) data that that is > temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask > different portions for different purposes/analyses. I consider these two > separate, completely independent issues and I unfortunately currently have > to kluge a lot to handle this. > > Concretely, consider a list of 100,000 observations (rows), with 12 > measures per observation-row (a 100,000 x 12 array). Every now and then, > sprinkled throughout this array, I have missing values (someone didn't > answer a question, or a computer failed to record a response, or whatever). > For some analyses I want to mask the whole row (e.g., complete-case > analysis), leaving me with array entries that should be tagged with all 4 > possible labels: > > 1) not masked, not missing > 2) masked, not missing > 3) not masked, missing > 4) masked, missing > > Obviously #4 is "overkill" ... but only until I want to unmask that row. At > that point, I need to be sure that missing values remain missing when > unmasked. Can a single API really handle this? > The single API does support a masked array with an NA dtype, and the behavior in this case will be that the value is considered NA if either it is masked or the value is the NA bit pattern. So you could add a mask to an array with an NA dtype to temporarily treat the data as if more values were missing. One important reason I'm doing it this way is so that each NumPy algorithm and any 3rd party code only needs to be updated once to support both forms of missing data. The C API with masks is also a lot cleaner to work with than one for NA dtypes with the ability to have different NA bit patterns. -Mark > > -best > Gary > > > The information in this e-mail is intended only for the person to whom it > is > addressed. If you believe this e-mail was sent to you in error and the > e-mail > contains patient information, please contact the Partners Compliance > HelpLine at > http://www.partners.org/**complianceline. If the e-mail was sent to you in error > but does not contain patient information, please contact the sender and > properly > dispose of the e-mail. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Thu Jun 30 12:30:06 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 17:30:06 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: <277B3D4F-5DA1-4DFA-8666-729E9382D128@gmail.com> References: <277B3D4F-5DA1-4DFA-8666-729E9382D128@gmail.com> Message-ID: Hi, On Thu, Jun 30, 2011 at 5:03 PM, Pierre GM wrote: > > On Jun 30, 2011, at 5:38 PM, Matthew Brett wrote: > >> Hi, >> >> On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM wrote: >>> >>> On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote: >>>> ############################################### >>>> A alternative-NEP on masking and missing values >>>> ############################################### >>> >>> I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way. >>> >>> >>>> ************** >>>> Initialization >>>> ************** >>>> >>>> First, missing values can be set and be displayed as ``np.NA, NA``:: >>>> >>>>>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') >>>> ? ?array([1., 2., NA, 7.], dtype='NA[>>> >>>> As the initialization is not ambiguous, this can be written without the NA >>>> dtype:: >>>> >>>>>>> np.array([1.0, 2.0, np.NA, 7.0]) >>>> ? ?array([1., 2., NA, 7.], dtype='NA[>>> >>>> Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >>>> >>>>>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) >>>> ? ?array([1., 2., MASKED, 7.], masked=True) >>>> >>>> As the initialization is not ambiguous, this can be written without >>>> ``masked=True``:: >>>> >>>>>>> np.array([1.0, 2.0, np.MASKED, 7.0]) >>>> ? ?array([1., 2., MASKED, 7.], masked=True) >>> >>> I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here. >> >> If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then >> obviously I mean it should be masked, so the 'masked=True' here is >> completely redundant, yes, I agree. ?And in fact: >> >> np.array([1.0, 2.0, np.MASKED, 7.0], masked=False) >> >> should raise an error. ?On the other hand, if I make a normal array: >> >> arr = np.array([1.0, 2.0, 7.0]) >> >> and then do this: >> >> arr.visible[2] = False >> >> then either I should raise an error (it's not a masked array), or, >> more magically, construct a mask on the fly. ? This somewhat breaks >> expectations though, because you might just have made a largish mask >> array without having any clue that that had happened. > > Well, I'd expect an error to be raised when assigning a NA if the initial array is not NA friendly. The 'magical' creation of a mask would be nice, but is probably too magic and best left alone. I agree :) >>> >>>> >>>> Direct assignnent in the masked case is magic and confusing, and so happens only >>>> via the mask:: >>>> >>>>>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>>>>>> masked_arr[2] = np.NA >>>> ? ?TypeError('dtype does not support NA') >>>>>>> masked_arr[2] = np.MASKED >>>> ? ?TypeError('float() argument must be a string or a number') >>>>>>> masked_arr.visible[2] = False >>>>>>> masked_arr >>>> ? ?array([1., 2., MASKED], masked=True) >>> >>> What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ? >> >> Well, for the np.NA case, this is straightforward: >> >> na_arr[2] = 3 >> >> It's just assignment. For ``masked_array[2] = 3`` - I don't know, I >> guess whatever we are used to. ?What do you think? > > Ahah, that depends. > With a = np.array([1., np.NA, 3.]), then a[1]=2. should raise an error, as Mark suggests: you can't "unmask" a missing value, you need to create a view of the initial array then "unmask". It's the equivalent of a hard mask. In this alterNEP, the NAs and the masked values are completely different. So, if you do this: a = np.array([1., np.NA, 3.]) then you've unambiguously asked for an array that can handle floats and NAs, and that will be the NA[ With a = np.array([1., np.IGNORE, 3.]), then a[1]=2. should give np.array([1.,2.,3.]) and a.mask=[False,False,False]. That's a soft mask. Sounds reasonable to me... Cheers, Matthew From matthew.brett at gmail.com Thu Jun 30 12:42:38 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 17:42:38 +0100 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe wrote: > On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman > wrote: >> >>> ? ? ?Clearly there are some overlaps between what masked arrays are >>> ? ? ?trying to achieve and what Rs NA mechanisms are trying to achieve. >>> ? ? ??Are they really similar enough that they should function using >>> ? ? ?the same API? >>> >>> Yes. >>> >>> ? ? ?And if so, won't that be confusing? >>> >>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are >>> already >>> confusing. >> >> As one who's been silently following (most of) this thread, and a heavy R >> and numpy user, perhaps I should chime in briefly here with a use case. I >> more-or-less always work with partially masked data, like Matthew, but not >> numpy masked arrays because the memory overhead is prohibitive. And, sad to >> say, my experiments don't always go perfectly. I therefore have arrays in >> which there is /both/ (1) data that is simply missing (np.NA?)--it never had >> a value and never will--as well as simultaneously (2) data that that is >> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask >> different portions for different purposes/analyses. I consider these two >> separate, completely independent issues and I unfortunately currently have >> to kluge a lot to handle this. >> >> Concretely, consider a list of 100,000 observations (rows), with 12 >> measures per observation-row (a 100,000 x 12 array). Every now and then, >> sprinkled throughout this array, I have missing values (someone didn't >> answer a question, or a computer failed to record a response, or whatever). >> For some analyses I want to mask the whole row (e.g., complete-case >> analysis), leaving me with array entries that should be tagged with all 4 >> possible labels: >> >> 1) not masked, not missing >> 2) masked, not missing >> 3) not masked, missing >> 4) masked, missing >> >> Obviously #4 is "overkill" ... but only until I want to unmask that row. >> At that point, I need to be sure that missing values remain missing when >> unmasked. Can a single API really handle this? > > The single API does support a masked array with an NA dtype, and the > behavior in this case will be that the value is considered NA if either it > is masked or the value is the NA bit pattern. So you could add a mask to an > array with an NA dtype to temporarily treat the data as if more values were > missing. Right - but I think the separated API is cleaner and easier to explain. Do you disagree? > One important reason I'm doing it this way is so that each NumPy algorithm > and any 3rd party code only needs to be updated once to support both forms > of missing data. Could you explain what you mean? Maybe a couple of examples? Whatever API results, it will surely be with us for a long time, and so it would be good to make sure we have the right one even if it costs a bit more to update current code. Cheers, Matthew From xscript at gmx.net Thu Jun 30 12:54:27 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 18:54:27 +0200 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: (Mark Wiebe's message of "Thu, 30 Jun 2011 10:06:24 -0500") References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: <87vcvntg3g.fsf@ginnungagap.bsc.es> Mark Wiebe writes: > Why is one "magic" and the other "real"? All of this is already > sitting on 100 layers of abstraction above electrons and atoms. If > we're talking about "real," maybe we should be programming in machine > code or using breadboards with individual transistors. M-x butterfly RET http://xkcd.com/378/ -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From tkgamble at windstream.net Thu Jun 30 13:32:22 2011 From: tkgamble at windstream.net (Thomas K Gamble) Date: Thu, 30 Jun 2011 11:32:22 -0600 Subject: [Numpy-discussion] broacasting question Message-ID: <201106301132.22357.tkgamble@windstream.net> I'm trying to convert some IDL code to python/numpy and i'm having some trouble understanding the rules for boradcasting during some operations. example: given the following arrays: a = array((2048,3577), dtype=float) b = array((256,25088), dtype=float) c = array((2048,3136), dtype=float) d = array((2048,3136), dtype=float) do: a = b * c + d In IDL, the computation is done without complaint and all array sizes are preserved. In ptyhon I get a value error concerning broadcasting. I can force it to work by taking slices, but the resulting size would be a = (256x3136) rather than (2048x3577). I admit that I don't understand IDL (or python to be honest) well enough to know how it handles this to be able to replicate the result properly. Does it only operate on the smallest dimensions ignoring the larger indices leaving their values unchanged? Can someone explain this to me? -- Thomas K. Gamble tkgamble at windstream.net From shish at keba.be Thu Jun 30 13:36:15 2011 From: shish at keba.be (Olivier Delalleau) Date: Thu, 30 Jun 2011 13:36:15 -0400 Subject: [Numpy-discussion] broacasting question In-Reply-To: <201106301132.22357.tkgamble@windstream.net> References: <201106301132.22357.tkgamble@windstream.net> Message-ID: 2011/6/30 Thomas K Gamble > I'm trying to convert some IDL code to python/numpy and i'm having some > trouble understanding the rules for boradcasting during some operations. > example: > > given the following arrays: > a = array((2048,3577), dtype=float) > b = array((256,25088), dtype=float) > c = array((2048,3136), dtype=float) > d = array((2048,3136), dtype=float) > > do: > a = b * c + d > > In IDL, the computation is done without complaint and all array sizes are > preserved. In ptyhon I get a value error concerning broadcasting. I can > force it to work by taking slices, but the resulting size would be a = > (256x3136) rather than (2048x3577). I admit that I don't understand IDL > (or > python to be honest) well enough to know how it handles this to be able to > replicate the result properly. Does it only operate on the smallest > dimensions ignoring the larger indices leaving their values unchanged? Can > someone explain this to me? > > -- > Thomas K. Gamble tkgamble at windstream.net > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > It should be all explained here: http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html -=- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From xscript at gmx.net Thu Jun 30 13:46:27 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 19:46:27 +0200 Subject: [Numpy-discussion] missing data: semantics Message-ID: <87oc1fp5zg.fsf@ginnungagap.bsc.es> Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 "generic" concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the "skipna = True" argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default "skipna" value (undesirable) or we can make a trivial ndarray subclass that will set the "skipna" argument on all ufuncs through the "_ufunc_wrapper_" mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument "transientna = True". Here, both the view and the "transientna = True" are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with "transientna = True" will *not* be seen by any of the other views (nor the "original" array), but anything else will still work "as usual". After all, this is what *you* asked for when using the "transientna = True" argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that "transientna", just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From njs at pobox.com Thu Jun 30 13:51:08 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 30 Jun 2011 10:51:08 -0700 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett wrote: > In the interest of making the discussion as concrete as possible, here > is my draft of an alternative proposal for NAs and masking, based on > Nathaniel's comments. ?Writing it, it seemed to me that Nathaniel is > right, that the ideas become much clearer when the NA idea and the > MASK idea are separate. ? Please do pitch in for things I may have > missed or misunderstood: [...] Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version: https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes: https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk". One thing I wonder about the design is whether having an np.MASKED/np.IGNORE value at all helps or hurts. (Occam tells us never to multiply entities without necessity! And it's a bit of an odd fit to the masking concept, since the whole idea is that masking is a property of the array, not the individual datums.) Currently, I see the following uses for it: -- As a return value when someone tries to scalar-index a masked value -- As a placeholder to specify masked values when creating an array from a list (but not when assigning to an array later) -- As a return value when using propmask=True -- As something to display when printing a masked array Another way of doing things would be: -- Scalar-indexing a masked value returns an error, like trying to index past the end of an array. (Slicing etc. would still return a new masked array.) -- Having some sort of placeholder does seem nice, but I'm not sure how often you need to type out a masked array. And I notice that numpy.ma does support this (like so: ma.array([1, ma.masked, 3])) but the examples in the docs never use it. The replacement idiom would be something like: my_data = np.array([1, 999, 3], masked=True); my_data.visible = (my_data != 999). So maybe just leave out the placeholder value, at least for version 1? -- I don't really see the logic for supporting 'propmask' at all. AFAICT no-one has ever even considered this as a useful feature for numpy.ma, never mind implemented it? -- When printing, the numpy.ma approach of using "--" seems much more readable than me than having "IGNORE" all over my screen. So overall, making these changes would let us simplify the design. But maybe propmask is really critical for some use case, or there's some good reason to want to scalar-index missing values without getting an error? -- Nathaniel From matthew.brett at gmail.com Thu Jun 30 13:51:42 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 18:51:42 +0100 Subject: [Numpy-discussion] missing data: semantics In-Reply-To: <87oc1fp5zg.fsf@ginnungagap.bsc.es> References: <87oc1fp5zg.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Thu, Jun 30, 2011 at 6:46 PM, Llu?s wrote: > Ok, I think it's time to step back and reformulate the problem by > completely ignoring the implementation. > > Here we have 2 "generic" concepts (i.e., applicable to R), plus another > extra concept that is exclusive to numpy: > > * Assigning np.NA to an array, cannot be undone unless through explicit > ?assignment (i.e., assigning a new arbitrary value, or saving a copy of > ?the original array before assigning np.NA). > > * np.NA values propagate by default, unless ufuncs have the "skipna = > ?True" argument (or the other way around, it doesn't really matter to > ?this discussion). In order to avoid passing the argument on each > ?ufunc, we either have some per-array variable for the default "skipna" > ?value (undesirable) or we can make a trivial ndarray subclass that > ?will set the "skipna" argument on all ufuncs through the > ?"_ufunc_wrapper_" mechanism. > > > > Now, numpy has the concept of views, which adds some more goodies to the > list of concepts: > > * With views, two arrays can share the same physical data, so that > ?assignments to any of them will be seen by others (including NA > ?values). > > The creation of a view is explicitly stated by the user, so its > behaviour should not be perceived as odd (after all, you asked for a > view). > > The good thing is that with views you can avoid costly array copies if > you're careful when writing into these views. > > > > Now, you can add a new concept: local/temporal/transient missing data. > > We can take an existing array and create a view with the new argument > "transientna = True". > > Here, both the view and the "transientna = True" are explicitly stated > by the user, so it is assumed that she already knows what this is all > about. > > The difference with a regular view is that you also explicitly asked for > local/temporal/transient NA values. > > * Assigning np.NA to an array view with "transientna = True" will > ?*not* be seen by any of the other views (nor the "original" array), > ?but anything else will still work "as usual". > > After all, this is what *you* asked for when using the "transientna = > True" argument. > > > > To conclude, say that others *must not* care about whether the arrays > they're working with have transient NA values. This way, I can create a > view with transient NAs, set to NA some uninteresting data, and pass it > to a routine written by someone else that sets to NA elements that, for > example, are beyond certain threshold from the mean of the elements. > > This would be equivalent to storing a copy of the original array before > passing it to this 3rd party function, only that "transientna", just as > views, provide some handy shortcuts to avoid copies. > > > My main point here is that views and local/temporal/transient NAs are > all *explicitly* requested, so that its behaviour should not appear as > something unexpected. > > Is there an agreement on this? Absolutely, if by 'transientna' you mean 'masked'. The discussion is whether the NA API should be the same as the masking API. The thing you are describing is what masking is for, and what it's always been for, as far as I can see. We're arguing that to call this 'transientna' instead of 'masked' confuses two concepts that are different, to no good purpose. Best, Matthew From matthew.brett at gmail.com Thu Jun 30 13:58:59 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 18:58:59 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: Hi, On Thu, Jun 30, 2011 at 6:51 PM, Nathaniel Smith wrote: > On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett wrote: >> In the interest of making the discussion as concrete as possible, here >> is my draft of an alternative proposal for NAs and masking, based on >> Nathaniel's comments. ?Writing it, it seemed to me that Nathaniel is >> right, that the ideas become much clearer when the NA idea and the >> MASK idea are separate. ? Please do pitch in for things I may have >> missed or misunderstood: > [...] > > Thanks for writing this up! I stuck it up as a gist so we can edit it > more easily: > ?https://gist.github.com/1056379/ > This is your initial version: > ?https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 > And I made a few changes: > ?https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 > Specifically, I added a rationale section, changed np.MASKED to > np.IGNORE (as per comments in this thread), and added a vowel to > "propmsk". Thanks for doing that. > One thing I wonder about the design is whether having an > np.MASKED/np.IGNORE value at all helps or hurts. (Occam tells us never > to multiply entities without necessity! And it's a bit of an odd fit > to the masking concept, since the whole idea is that masking is a > property of the array, not the individual datums.) > > Currently, I see the following uses for it: > ?-- As a return value when someone tries to scalar-index a masked value > ?-- As a placeholder to specify masked values when creating an array > from a list (but not when assigning to an array later) > ?-- As a return value when using propmask=True > ?-- As something to display when printing a masked array > > Another way of doing things would be: > ?-- Scalar-indexing a masked value returns an error, like trying to > index past the end of an array. (Slicing etc. would still return a new > masked array.) > ?-- Having some sort of placeholder does seem nice, but I'm not sure > how often you need to type out a masked array. And I notice that > numpy.ma does support this (like so: ma.array([1, ma.masked, 3])) but > the examples in the docs never use it. The replacement idiom would be > something like: my_data = np.array([1, 999, 3], masked=True); > my_data.visible = (my_data != 999). So maybe just leave out the > placeholder value, at least for version 1? > ?-- I don't really see the logic for supporting 'propmask' at all. > AFAICT no-one has ever even considered this as a useful feature for > numpy.ma, never mind implemented it? > ?-- When printing, the numpy.ma approach of using "--" seems much > more readable than me than having "IGNORE" all over my screen. > > So overall, making these changes would let us simplify the design. But > maybe propmask is really critical for some use case, or there's some > good reason to want to scalar-index missing values without getting an > error? I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs? See you, Matthew From derek at astro.physik.uni-goettingen.de Thu Jun 30 14:11:45 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Thu, 30 Jun 2011 20:11:45 +0200 Subject: [Numpy-discussion] broacasting question In-Reply-To: <201106301132.22357.tkgamble@windstream.net> References: <201106301132.22357.tkgamble@windstream.net> Message-ID: <72D0EE64-18EA-422B-A0B2-856084D6A8ED@astro.physik.uni-goettingen.de> On 30.06.2011, at 7:32PM, Thomas K Gamble wrote: > I'm trying to convert some IDL code to python/numpy and i'm having some > trouble understanding the rules for boradcasting during some operations. > example: > > given the following arrays: > a = array((2048,3577), dtype=float) > b = array((256,25088), dtype=float) > c = array((2048,3136), dtype=float) > d = array((2048,3136), dtype=float) > > do: > a = b * c + d > > In IDL, the computation is done without complaint and all array sizes are > preserved. In ptyhon I get a value error concerning broadcasting. I can > force it to work by taking slices, but the resulting size would be a = > (256x3136) rather than (2048x3577). I admit that I don't understand IDL (or > python to be honest) well enough to know how it handles this to be able to > replicate the result properly. Does it only operate on the smallest > dimensions ignoring the larger indices leaving their values unchanged? Can > someone explain this to me? If IDL does such operations silently I'd probably rather be troubled about it... Assuming you actually meant something like "a = np.ndarray((2048,3577))" (because np.array((2048,3577), dtype=float) would simply be the 2-vector [ 2048. 3577.]), the shape of a indeed matches in no way the other ones. While b,c,d do have the same total size, thus something like b.reshape((2048,3136) * c + d will work, meaning the first 8 rows of b b[:8] would be concatenated to the first row of the output, and so on... Since the total size is still smaller than a, I could only venture something is done like np.add(b.reshape(2048,3136) * c, d, out=a[:,:3136]) But to say whether this is really the equivalent result to what IDL does, one would have to study the IDL manual in detail or directly compare the output (e.g. check what happens to the values in a[:,3136:]...) Cheers, Derek From charlesr.harris at gmail.com Thu Jun 30 14:11:52 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 30 Jun 2011 12:11:52 -0600 Subject: [Numpy-discussion] missing data: semantics In-Reply-To: <87oc1fp5zg.fsf@ginnungagap.bsc.es> References: <87oc1fp5zg.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:46 AM, Llu?s wrote: > Ok, I think it's time to step back and reformulate the problem by > completely ignoring the implementation. > > Here we have 2 "generic" concepts (i.e., applicable to R), plus another > extra concept that is exclusive to numpy: > > * Assigning np.NA to an array, cannot be undone unless through explicit > assignment (i.e., assigning a new arbitrary value, or saving a copy of > the original array before assigning np.NA). > > * np.NA values propagate by default, unless ufuncs have the "skipna = > True" argument (or the other way around, it doesn't really matter to > this discussion). In order to avoid passing the argument on each > ufunc, we either have some per-array variable for the default "skipna" > value (undesirable) or we can make a trivial ndarray subclass that > will set the "skipna" argument on all ufuncs through the > "_ufunc_wrapper_" mechanism. > > > > Now, numpy has the concept of views, which adds some more goodies to the > list of concepts: > > * With views, two arrays can share the same physical data, so that > assignments to any of them will be seen by others (including NA > values). > > The creation of a view is explicitly stated by the user, so its > behaviour should not be perceived as odd (after all, you asked for a > view). > > The good thing is that with views you can avoid costly array copies if > you're careful when writing into these views. > > > > Now, you can add a new concept: local/temporal/transient missing data. > > We can take an existing array and create a view with the new argument > "transientna = True". > > This is already there: x.view(masked=1), although the keyword transientna has appeal, not least because it avoids the word 'mask', which seems a source of endless confusion. Note that currently this is only supposed to work if the original array is unmasked. Here, both the view and the "transientna = True" are explicitly stated > by the user, so it is assumed that she already knows what this is all > about. > > The difference with a regular view is that you also explicitly asked for > local/temporal/transient NA values. > > * Assigning np.NA to an array view with "transientna = True" will > *not* be seen by any of the other views (nor the "original" array), > but anything else will still work "as usual". > > After all, this is what *you* asked for when using the "transientna = > True" argument. > > > > To conclude, say that others *must not* care about whether the arrays > they're working with have transient NA values. This way, I can create a > view with transient NAs, set to NA some uninteresting data, and pass it > to a routine written by someone else that sets to NA elements that, for > example, are beyond certain threshold from the mean of the elements. > > This would be equivalent to storing a copy of the original array before > passing it to this 3rd party function, only that "transientna", just as > views, provide some handy shortcuts to avoid copies. > > > My main point here is that views and local/temporal/transient NAs are > all *explicitly* requested, so that its behaviour should not appear as > something unexpected. > > Is there an agreement on this? > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jun 30 14:13:59 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 30 Jun 2011 12:13:59 -0600 Subject: [Numpy-discussion] missing data: semantics In-Reply-To: References: <87oc1fp5zg.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:51 AM, Matthew Brett wrote: > Hi, > > On Thu, Jun 30, 2011 at 6:46 PM, Llu?s wrote: > > Ok, I think it's time to step back and reformulate the problem by > > completely ignoring the implementation. > > > > Here we have 2 "generic" concepts (i.e., applicable to R), plus another > > extra concept that is exclusive to numpy: > > > > * Assigning np.NA to an array, cannot be undone unless through explicit > > assignment (i.e., assigning a new arbitrary value, or saving a copy of > > the original array before assigning np.NA). > > > > * np.NA values propagate by default, unless ufuncs have the "skipna = > > True" argument (or the other way around, it doesn't really matter to > > this discussion). In order to avoid passing the argument on each > > ufunc, we either have some per-array variable for the default "skipna" > > value (undesirable) or we can make a trivial ndarray subclass that > > will set the "skipna" argument on all ufuncs through the > > "_ufunc_wrapper_" mechanism. > > > > > > > > Now, numpy has the concept of views, which adds some more goodies to the > > list of concepts: > > > > * With views, two arrays can share the same physical data, so that > > assignments to any of them will be seen by others (including NA > > values). > > > > The creation of a view is explicitly stated by the user, so its > > behaviour should not be perceived as odd (after all, you asked for a > > view). > > > > The good thing is that with views you can avoid costly array copies if > > you're careful when writing into these views. > > > > > > > > Now, you can add a new concept: local/temporal/transient missing data. > > > > We can take an existing array and create a view with the new argument > > "transientna = True". > > > > Here, both the view and the "transientna = True" are explicitly stated > > by the user, so it is assumed that she already knows what this is all > > about. > > > > The difference with a regular view is that you also explicitly asked for > > local/temporal/transient NA values. > > > > * Assigning np.NA to an array view with "transientna = True" will > > *not* be seen by any of the other views (nor the "original" array), > > but anything else will still work "as usual". > > > > After all, this is what *you* asked for when using the "transientna = > > True" argument. > > > > > > > > To conclude, say that others *must not* care about whether the arrays > > they're working with have transient NA values. This way, I can create a > > view with transient NAs, set to NA some uninteresting data, and pass it > > to a routine written by someone else that sets to NA elements that, for > > example, are beyond certain threshold from the mean of the elements. > > > > This would be equivalent to storing a copy of the original array before > > passing it to this 3rd party function, only that "transientna", just as > > views, provide some handy shortcuts to avoid copies. > > > > > > My main point here is that views and local/temporal/transient NAs are > > all *explicitly* requested, so that its behaviour should not appear as > > something unexpected. > > > > Is there an agreement on this? > > Absolutely, if by 'transientna' you mean 'masked'. The discussion is > whether the NA API should be the same as the masking API. The thing > you are describing is what masking is for, and what it's always been > for, as far as I can see. We're arguing that to call this > 'transientna' instead of 'masked' confuses two concepts that are > different, to no good purpose. > > It's a hammer. If you want to hammer nails, fine, if you want hammer a bit of tubing flat, fine. It's a tool, the hammer concept if you will. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 14:17:50 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 13:17:50 -0500 Subject: [Numpy-discussion] review request: introductory datetime documentation Message-ID: https://github.com/numpy/numpy/pull/101 Thanks, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From xscript at gmx.net Thu Jun 30 14:27:45 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 20:27:45 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: (Matthew Brett's message of "Thu, 30 Jun 2011 18:58:59 +0100") References: Message-ID: <87r56bnpi6.fsf@ginnungagap.bsc.es> Matthew Brett writes: [...] > I'm afraid, like you, I'm a little lost in the world of masking, > because I only need the NAs. I was trying to see if I could come up > with an API that picked up some of the syntactic convenience of NAs, > without conflating NAs with IGNOREs. I guess we need some feedback > from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea > of what we've missed. @Mark, @Chuck, guys - what have we lost here by > separating the APIs? As I tried to convey on my other mail, separating both will force you to either: * Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data) or * Tell the other routine whether it should use np.NA or np.IGNORE *and* whether it should use "skipna" and/or "propmask". To me, that's the whole point about a unified API: * Avoid making array copies. * Do not add more arguments to *all* routines (to tell them which kind of missing data they should produce, and which kind of missing data they should ignore/propagate). Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From mwwiebe at gmail.com Thu Jun 30 14:31:38 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 13:31:38 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:42 AM, Matthew Brett wrote: > Hi, > > On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe wrote: > > On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman > > wrote: > >> > >>> Clearly there are some overlaps between what masked arrays are > >>> trying to achieve and what Rs NA mechanisms are trying to achieve. > >>> Are they really similar enough that they should function using > >>> the same API? > >>> > >>> Yes. > >>> > >>> And if so, won't that be confusing? > >>> > >>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are > >>> already > >>> confusing. > >> > >> As one who's been silently following (most of) this thread, and a heavy > R > >> and numpy user, perhaps I should chime in briefly here with a use case. > I > >> more-or-less always work with partially masked data, like Matthew, but > not > >> numpy masked arrays because the memory overhead is prohibitive. And, sad > to > >> say, my experiments don't always go perfectly. I therefore have arrays > in > >> which there is /both/ (1) data that is simply missing (np.NA?)--it never > had > >> a value and never will--as well as simultaneously (2) data that that is > >> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask > >> different portions for different purposes/analyses. I consider these two > >> separate, completely independent issues and I unfortunately currently > have > >> to kluge a lot to handle this. > >> > >> Concretely, consider a list of 100,000 observations (rows), with 12 > >> measures per observation-row (a 100,000 x 12 array). Every now and then, > >> sprinkled throughout this array, I have missing values (someone didn't > >> answer a question, or a computer failed to record a response, or > whatever). > >> For some analyses I want to mask the whole row (e.g., complete-case > >> analysis), leaving me with array entries that should be tagged with all > 4 > >> possible labels: > >> > >> 1) not masked, not missing > >> 2) masked, not missing > >> 3) not masked, missing > >> 4) masked, missing > >> > >> Obviously #4 is "overkill" ... but only until I want to unmask that row. > >> At that point, I need to be sure that missing values remain missing when > >> unmasked. Can a single API really handle this? > > > > The single API does support a masked array with an NA dtype, and the > > behavior in this case will be that the value is considered NA if either > it > > is masked or the value is the NA bit pattern. So you could add a mask to > an > > array with an NA dtype to temporarily treat the data as if more values > were > > missing. > > Right - but I think the separated API is cleaner and easier to > explain. Do you disagree? > Kind of, yeah. I think the important things to understand from the Python perspective are that there are two ways of doing missing values with NA that look exactly the same except for how you create the arrays. Since you know that the mask way takes more memory, and that's important for your application, you can decide to use the NA dtype without any additional depth. Understanding that one of them has a special signal for NA while the other uses masks in the background probably isn't even that important to understand to be able to use it. I bet lots of people who use R regularly couldn't come up with a correct explanation of how it works there. If someone doesn't understand masks, they can use their intuition based on the special signal idea without any difficulty. The idea that you can temporarily make some values NA without overwriting your data may not be intuitive at first glance, but I expect people will find it useful even if they don't fully understand the subtle details of the masking mechanism. > One important reason I'm doing it this way is so that each NumPy algorithm > > and any 3rd party code only needs to be updated once to support both > forms > > of missing data. > > Could you explain what you mean? Maybe a couple of examples? > Yeah, I've started adding some implementation notes to the NEP. First I need volunteers to review my current pull requests though. ;) -Mark > > Whatever API results, it will surely be with us for a long time, and > so it would be good to make sure we have the right one even if it > costs a bit more to update current code. > > Cheers, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwwiebe at gmail.com Thu Jun 30 14:32:20 2011 From: mwwiebe at gmail.com (Mark Wiebe) Date: Thu, 30 Jun 2011 13:32:20 -0500 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <87vcvntg3g.fsf@ginnungagap.bsc.es> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <87vcvntg3g.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:54 AM, Llu?s wrote: > Mark Wiebe writes: > > Why is one "magic" and the other "real"? All of this is already > > sitting on 100 layers of abstraction above electrons and atoms. If > > we're talking about "real," maybe we should be programming in machine > > code or using breadboards with individual transistors. > > M-x butterfly RET > > http://xkcd.com/378/ Ok, I've run this, how long does it take to execute? -Mark > > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Jun 30 14:49:27 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 30 Jun 2011 11:49:27 -0700 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: <87r56bnpi6.fsf@ginnungagap.bsc.es> References: <87r56bnpi6.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 11:27 AM, Llu?s wrote: > As I tried to convey on my other mail, separating both will force you to > either: > > * Make a copy of the array before passing it to another routine (because > ?the routine will assign np.NA but you still want the original data) To help me understand, do you have an example in mind of a routine that would do that? I can't think of any cases where I had some original data that some routine wanted to throw out and replace with NAs; it just seems... weird. Maybe I'm missing something though... (I can imagine that it would make sense for what we're calling a masked array, where you have some routine which computes which values should be ignored for a particular purpose. But if it only makes sense for masked arrays then you can just write your routine to work with masked arrays only, and it doesn't matter how similar the masking and missing APIs are.) -- Nathaniel From njs at pobox.com Thu Jun 30 14:53:40 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 30 Jun 2011 11:53:40 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0B9743.10608@hawaii.edu> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <4E0B9743.10608@hawaii.edu> Message-ID: On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing wrote: > In addition, for new code, the full-blown masked array module may not be > needed. ?A convenience it adds, however, is the automatic masking of > invalid values: > > In [1]: np.ma.log(-1) > Out[1]: masked > > I'm sure this horrifies some, but there are times and places where it is > a genuine convenience, and preferable to having to use a separate > operation to replace nan or inf with NA or whatever it ends up being. Err, but what would this even get you? NA, NaN, and Inf basically all behave the same WRT floating point operations anyway, i.e., they all propagate? Is the idea that if ufunc's gain a skipna=True flag, you'd also like to be able to turn it into a skipna_and_nan_and_inf=True flag? -- Nathaniel From matthew.brett at gmail.com Thu Jun 30 15:03:23 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jun 2011 20:03:23 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: <87r56bnpi6.fsf@ginnungagap.bsc.es> References: <87r56bnpi6.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Thu, Jun 30, 2011 at 7:27 PM, Llu?s wrote: > Matthew Brett writes: > [...] >> I'm afraid, like you, I'm a little lost in the world of masking, >> because I only need the NAs. ?I was trying to see if I could come up >> with an API that picked up some of the syntactic convenience of NAs, >> without conflating NAs with IGNOREs. ? I guess we need some feedback >> from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea >> of what we've missed. ?@Mark, @Chuck, guys - what have we lost here by >> separating the APIs? > > As I tried to convey on my other mail, separating both will force you to > either: > > * Make a copy of the array before passing it to another routine (because > ?the routine will assign np.NA but you still want the original data) You have an array 'arr'. The array does support NAs, but it doesn't have a mask. You want to pass ``arr`` to another routine ``func``. You expect ``func`` to set NAs into the data but you don't want ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. You are saying the following: "with the fused API, I can make ``arr`` be a masked array, and pass it into ``func``, and know that, when func sets elements of arr to NA, it will only modify the mask and not the underlying data in ``arr``." It does seem to me this is a very obscure case. First, ``func`` is modifying the array but you want an unmodified array back. Second, you'll have to do some view trick to recover the not-NA case to arr, when it comes back. It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least. Best, Matthew From xscript at gmx.net Thu Jun 30 15:06:21 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 21:06:21 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: (Nathaniel Smith's message of "Thu, 30 Jun 2011 11:49:27 -0700") References: <87r56bnpi6.fsf@ginnungagap.bsc.es> Message-ID: <87wrg3m95e.fsf@ginnungagap.bsc.es> Nathaniel Smith writes: > On Thu, Jun 30, 2011 at 11:27 AM, Llu?s wrote: >> As I tried to convey on my other mail, separating both will force you to >> either: >> >> * Make a copy of the array before passing it to another routine (because >> ?the routine will assign np.NA but you still want the original data) > To help me understand, do you have an example in mind of a routine > that would do that? I can't think of any cases where I had some > original data that some routine wanted to throw out and replace with > NAs; it just seems... weird. Maybe I'm missing something though... Well, I had some silly example on another thread. A function that computes the mean of all non-NA values, and assigns NA to all cells that are beyond certain threshold of that mean value. > (I can imagine that it would make sense for what we're calling a > masked array, where you have some routine which computes which values > should be ignored for a particular purpose. But if it only makes sense > for masked arrays then you can just write your routine to work with > masked arrays only, and it doesn't matter how similar the masking and > missing APIs are.) The routine makes sense by itself as a beyond-mean detector. The routine must not care whether your NAs are transient or not (in your aNEP, whether you want it to assign np.NA or np.IGNORE, which must be indicated by the caller through yet another function argument). Note that callers will not only have to indicate which "type" of missing data the calle should use (np.NA or np.IGNORE), but they also have to indicate whether np.NAs must be ignored (i.e., skipna=bool), as well as np.IGNORE (i.e., propmask=bool). Of course it is doable, but adding 3 more arguments to *all* functions (including ufuncs, and higher-level functions) does not seem as desirable to me. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From efiring at hawaii.edu Thu Jun 30 15:27:24 2011 From: efiring at hawaii.edu (Eric Firing) Date: Thu, 30 Jun 2011 09:27:24 -1000 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <4E0B9743.10608@hawaii.edu> Message-ID: <4E0CCE1C.5010400@hawaii.edu> On 06/30/2011 08:53 AM, Nathaniel Smith wrote: > On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing wrote: >> In addition, for new code, the full-blown masked array module may not be >> needed. A convenience it adds, however, is the automatic masking of >> invalid values: >> >> In [1]: np.ma.log(-1) >> Out[1]: masked >> >> I'm sure this horrifies some, but there are times and places where it is >> a genuine convenience, and preferable to having to use a separate >> operation to replace nan or inf with NA or whatever it ends up being. > > Err, but what would this even get you? NA, NaN, and Inf basically all > behave the same WRT floating point operations anyway, i.e., they all > propagate? Not exactly. First, it depends on np.seterr; second, calculations on NaN can be very slow, so are better avoided entirely; third, if an array is passed to extension code, it is much nicer if that code only has one NA value to handle, instead of having to check for all possible "bad" values. > > Is the idea that if ufunc's gain a skipna=True flag, you'd also like > to be able to turn it into a skipna_and_nan_and_inf=True flag? No, it is to have a situation where skipna_and_nan_and_inf would not be needed, because an operation generating a nan or inf would turn those values into NA or IGNORE or whatever right away. Eric > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From njs at pobox.com Thu Jun 30 15:52:47 2011 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 30 Jun 2011 12:52:47 -0700 Subject: [Numpy-discussion] missing data discussion round 2 In-Reply-To: <4E0CCE1C.5010400@hawaii.edu> References: <4E0A5875.8070803@creativetrax.com> <87oc1gg4ee.fsf@ginnungagap.bsc.es> <4E0B9743.10608@hawaii.edu> <4E0CCE1C.5010400@hawaii.edu> Message-ID: On Thu, Jun 30, 2011 at 12:27 PM, Eric Firing wrote: > On 06/30/2011 08:53 AM, Nathaniel Smith wrote: >> On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing ?wrote: >>> In addition, for new code, the full-blown masked array module may not be >>> needed. ?A convenience it adds, however, is the automatic masking of >>> invalid values: >>> >>> In [1]: np.ma.log(-1) >>> Out[1]: masked >>> >>> I'm sure this horrifies some, but there are times and places where it is >>> a genuine convenience, and preferable to having to use a separate >>> operation to replace nan or inf with NA or whatever it ends up being. >> >> Err, but what would this even get you? NA, NaN, and Inf basically all >> behave the same WRT floating point operations anyway, i.e., they all >> propagate? > > Not exactly. First, it depends on np.seterr; IIUC, you're proposing to make this conversion depend on np.seterr too, though, right? > second, calculations on NaN > can be very slow, so are better avoided entirely They're slow because inside the processor they require a branch and a separate code path (which doesn't get a lot of transistors allocated to it). In any of the NA proposals we're talking about, handling an NA would require a software branch and a separate code path (which is in ordinary software, now, so it doesn't get any special transistors allocated to it...). I don't think masking support is likely to give you a speedup over the processor's NaN handling. And if it did, that would mean that we speed up FP operations in general by checking for NaN in software, so then we should do that everywhere anyway instead of making it an NA-specific feature... > third, if an array is > passed to extension code, it is much nicer if that code only has one NA > value to handle, instead of having to check for all possible "bad" values. I'm pretty sure that Mark's proposal does not work this way -- he's saying that the NA-checking code in numpy could optionally check for all these different "bad" values and handle them the same in ufuncs, not that we would check the outputs of all FP operations for "bad" values and then replace them by NA. So your extension code would still have the same problem. Sorry :-( -- Nathaniel From xscript at gmx.net Thu Jun 30 16:01:07 2011 From: xscript at gmx.net (=?utf-8?Q?Llu=C3=ADs?=) Date: Thu, 30 Jun 2011 22:01:07 +0200 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: (Matthew Brett's message of "Thu, 30 Jun 2011 20:03:23 +0100") References: <87r56bnpi6.fsf@ginnungagap.bsc.es> Message-ID: <87zkkzjdh8.fsf@ginnungagap.bsc.es> Matthew Brett writes: > Hi, > On Thu, Jun 30, 2011 at 7:27 PM, Llu?s wrote: >> Matthew Brett writes: >> [...] >>> I'm afraid, like you, I'm a little lost in the world of masking, >>> because I only need the NAs. ?I was trying to see if I could come up >>> with an API that picked up some of the syntactic convenience of NAs, >>> without conflating NAs with IGNOREs. ? I guess we need some feedback >>> from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea >>> of what we've missed. ?@Mark, @Chuck, guys - what have we lost here by >>> separating the APIs? >> >> As I tried to convey on my other mail, separating both will force you to >> either: >> >> * Make a copy of the array before passing it to another routine (because >> ?the routine will assign np.NA but you still want the original data) > You have an array 'arr'. The array does support NAs, but it doesn't > have a mask. You want to pass ``arr`` to another routine ``func``. > You expect ``func`` to set NAs into the data but you don't want > ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. > You are saying the following: > "with the fused API, I can make ``arr`` be a masked array, and pass it > into ``func``, and know that, when func sets elements of arr to NA, it > will only modify the mask and not the underlying data in ``arr``." Yes. > It does seem to me this is a very obscure case. First, ``func`` is > modifying the array but you want an unmodified array back. Second, > you'll have to do some view trick to recover the not-NA case to arr, > when it comes back. I know, the example is just silly and convoluted. > It seems to me, that what ``func`` should do, if it wants you to be > able to unmask the NAs, is to make a masked array view of ``arr``, and > return that. And indeed the simplicity of the separated API > immediately makes that clear - in my view at least. I agree on this example. My only concern is on the API's ability to foresee as most future use-cases as possible, without impacting performance. 1) On one hand, we have that functions must be specially crafted to handle transient NA (i.e., create a masked array to store the output, which will be possibly optional, so it needs another function argument). And not everybody will foresee such usage, resulting in an inconsistent API w.r.t. np.NA vs np.IGNORE. We could alternatively see this as a knob to say, whenever you store np.NA, please use np.IGNORE. It all needs collaboration from the callee. 2) On the other hand, we have that it can all be controlled by the caller, who is really the only one that knows its needs. This, at the risk of confusing the user (I still believe the user should not be confused because the mask must be explicitly activated). If you're telling me "2 is not necessary because functions written as 1 are few and clearly identified", then I'll just say I don't know. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth From tkgamble at windstream.net Thu Jun 30 17:57:58 2011 From: tkgamble at windstream.net (Thomas K Gamble) Date: Thu, 30 Jun 2011 15:57:58 -0600 Subject: [Numpy-discussion] broacasting question In-Reply-To: <72D0EE64-18EA-422B-A0B2-856084D6A8ED@astro.physik.uni-goettingen.de> References: <201106301132.22357.tkgamble@windstream.net> <72D0EE64-18EA-422B-A0B2-856084D6A8ED@astro.physik.uni-goettingen.de> Message-ID: <201106301557.58611.tkgamble@windstream.net> > On 30.06.2011, at 7:32PM, Thomas K Gamble wrote: > > I'm trying to convert some IDL code to python/numpy and i'm having some > > trouble understanding the rules for boradcasting during some operations. > > example: > > > > given the following arrays: > > a = array((2048,3577), dtype=float) > > b = array((256,25088), dtype=float) > > c = array((2048,3136), dtype=float) > > d = array((2048,3136), dtype=float) > > > > do: > > a = b * c + d > > > > In IDL, the computation is done without complaint and all array sizes are > > preserved. In ptyhon I get a value error concerning broadcasting. I can > > force it to work by taking slices, but the resulting size would be a = > > (256x3136) rather than (2048x3577). I admit that I don't understand IDL > > (or python to be honest) well enough to know how it handles this to be > > able to replicate the result properly. Does it only operate on the > > smallest dimensions ignoring the larger indices leaving their values > > unchanged? Can someone explain this to me? > > If IDL does such operations silently I'd probably rather be troubled about > it... Assuming you actually meant something like "a = > np.ndarray((2048,3577))" (because np.array((2048,3577), dtype=float) would > simply be the 2-vector [ 2048. 3577.]), the shape of a indeed matches in > no way the other ones. While b,c,d do have the same total size, thus > something like > > b.reshape((2048,3136) * c + d > > will work, meaning the first 8 rows of b b[:8] would be concatenated to the > first row of the output, and so on... Since the total size is still > smaller than a, I could only venture something is done like > > np.add(b.reshape(2048,3136) * c, d, out=a[:,:3136]) > > But to say whether this is really the equivalent result to what IDL does, > one would have to study the IDL manual in detail or directly compare the > output (e.g. check what happens to the values in a[:,3136:]...) > > Cheers, > Derek Your post gave me the cluse I needed. I had my shapes slightly off in the example I gave, but if I try: a = reshape(b.flatten('F') * c.flatten('F') + d.flatten('F'), b.shape, order='F') I get a result in line with the IDL result. Another example with different total size arrays: b = np.ndarray((2048,3577), dtype=float) c = np.ndarray((256,25088), dtype=float) a= reshape(b.flatten('F')[:size(c)]/c.flatten('F'), c.shape, order='F') This also gives a result like that of IDL. Thanks for the help. -- Thomas K. Gamble tkgamble at windstream.net Receive my instruction, and not silver; and knowledge rather than choice gold. For wisdom is better than rubies; and all the things that may be desired are not to be compared to it. (Proverbs 8:10,11) From derek at astro.physik.uni-goettingen.de Thu Jun 30 19:31:31 2011 From: derek at astro.physik.uni-goettingen.de (Derek Homeier) Date: Fri, 1 Jul 2011 01:31:31 +0200 Subject: [Numpy-discussion] broacasting question In-Reply-To: <201106301557.58611.tkgamble@windstream.net> References: <201106301132.22357.tkgamble@windstream.net> <72D0EE64-18EA-422B-A0B2-856084D6A8ED@astro.physik.uni-goettingen.de> <201106301557.58611.tkgamble@windstream.net> Message-ID: <83163AE5-E45B-4369-A848-F4F62329E98E@astro.physik.uni-goettingen.de> On 30.06.2011, at 11:57PM, Thomas K Gamble wrote: >> np.add(b.reshape(2048,3136) * c, d, out=a[:,:3136]) >> >> But to say whether this is really the equivalent result to what IDL does, >> one would have to study the IDL manual in detail or directly compare the >> output (e.g. check what happens to the values in a[:,3136:]...) >> >> Cheers, >> Derek > > Your post gave me the cluse I needed. > > I had my shapes slightly off in the example I gave, but if I try: > > a = reshape(b.flatten('F') * c.flatten('F') + d.flatten('F'), b.shape, order='F') > > I get a result in line with the IDL result. > > Another example with different total size arrays: > > b = np.ndarray((2048,3577), dtype=float) > c = np.ndarray((256,25088), dtype=float) > > a= reshape(b.flatten('F')[:size(c)]/c.flatten('F'), c.shape, order='F') > > This also gives a result like that of IDL. Right, I forgot to point out that there are at least 2 ways to bring the arrays into compatible shapes (that's the reason broadcasting does not work here, because numpy only does automatic broadcasting if there is an unambiguous way to do so). So the IDL arrays being Fortran-ordered is the essential bit of information here. Just two remarks: I. Assigning a = reshape(b.flatten('F')[:size(c)]/c.flatten('F'), c.shape, order='F') as above will create a new array of shape c.shape - if you wanted to put your results into an existing array of shape(2048,3577), you'd still have to explicitly say a[:,:3136] = ... II. The flatten() operations and the assignment above all create full copies of the arrays, thus the np.add ufunc above together with simple reshape operations might improve performance somewhat - however keeping the Fortran order also requires some costly transpositions, as for your last example a = np.divide(b.T[:3136].reshape(c.T.shape).T, c, out=a) so YMMV... Cheers, Derek From matthew.brett at gmail.com Thu Jun 30 20:02:15 2011 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 1 Jul 2011 01:02:15 +0100 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: <87zkkzjdh8.fsf@ginnungagap.bsc.es> References: <87r56bnpi6.fsf@ginnungagap.bsc.es> <87zkkzjdh8.fsf@ginnungagap.bsc.es> Message-ID: Hi, On Thu, Jun 30, 2011 at 9:01 PM, Llu?s wrote: > Matthew Brett writes: > >> Hi, >> On Thu, Jun 30, 2011 at 7:27 PM, Llu?s wrote: >>> Matthew Brett writes: >>> [...] >>>> I'm afraid, like you, I'm a little lost in the world of masking, >>>> because I only need the NAs. ?I was trying to see if I could come up >>>> with an API that picked up some of the syntactic convenience of NAs, >>>> without conflating NAs with IGNOREs. ? I guess we need some feedback >>>> from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea >>>> of what we've missed. ?@Mark, @Chuck, guys - what have we lost here by >>>> separating the APIs? >>> >>> As I tried to convey on my other mail, separating both will force you to >>> either: >>> >>> * Make a copy of the array before passing it to another routine (because >>> ?the routine will assign np.NA but you still want the original data) > >> You have an array 'arr'. ? The array does support NAs, but it doesn't >> have a mask. ?You want to pass ``arr`` to another routine ``func``. >> You expect ``func`` to set NAs into the data but you don't want >> ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. >> You are saying the following: > >> "with the fused API, I can make ``arr`` be a masked array, and pass it >> into ``func``, and know that, when func sets elements of arr to NA, it >> will only modify the mask and not the underlying data in ``arr``." > > Yes. > > >> It does seem to me this is a very obscure case. ?First, ``func`` is >> modifying the array but you want an unmodified array back. ?Second, >> you'll have to do some view trick to recover the not-NA case to arr, >> when it comes back. > > I know, the example is just silly and convoluted. > > >> It seems to me, that what ``func`` should do, if it wants you to be >> able to unmask the NAs, is to make a masked array view of ``arr``, and >> return that. ? And indeed the simplicity of the separated API >> immediately makes that clear - in my view at least. > > I agree on this example. My only concern is on the API's ability to > foresee as most future use-cases as possible, without impacting > performance. But, of course, there's a great danger in trying to cover every possible use-case. My argument is that the kind of cases that you are describe are - I believe - very rare and are even a little difficult to make up. Is that fair? To my mind, the separate NA and IGNORE API is easier to understand and explain. If that isn't true, please do say, and say why - because that point is key. If it is true that the separate API is clearer, then the benefit in terms of power and extensibility has to be large, in order to go for the fused API. Cheers, Matthew From strang at nmr.mgh.harvard.edu Thu Jun 30 20:38:04 2011 From: strang at nmr.mgh.harvard.edu (Gary Strangman) Date: Thu, 30 Jun 2011 20:38:04 -0400 (EDT) Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: <87r56bnpi6.fsf@ginnungagap.bsc.es> <87zkkzjdh8.fsf@ginnungagap.bsc.es> Message-ID: >>> It seems to me, that what ``func`` should do, if it wants you to be >>> able to unmask the NAs, is to make a masked array view of ``arr``, and >>> return that. ? And indeed the simplicity of the separated API >>> immediately makes that clear - in my view at least. >> >> I agree on this example. My only concern is on the API's ability to >> foresee as most future use-cases as possible, without impacting >> performance. > > But, of course, there's a great danger in trying to cover every > possible use-case. > > My argument is that the kind of cases that you are describe are - I > believe - very rare and are even a little difficult to make up. Is > that fair? > > To my mind, the separate NA and IGNORE API is easier to understand and > explain. If that isn't true, please do say, and say why - because > that point is key. > > If it is true that the separate API is clearer, then the benefit in > terms of power and extensibility has to be large, in order to go for > the fused API. For what it's worth, I wholeheartedly agree with Matthew here. Being able to designate NA separately from IGNORE has tremendous conceptual clarity, at least for me. Not only are these are completely separate mental constructs in my head, but they even arise from completely different sources: NAs arise from my subjects whims, my experimental procedures, my research personnel, or bad equipment days, whereas IGNORE generally comes from me and my analysis or visualization needs. While I bet it's possible for an exceedingly clever person to fuse the two (I doubt my brain could pull that off), I fear that in the end I would have to go to the documentation every time in order to use either one. Thus, I agree that fusing into a single API needs to have a very large benefit. I admit I haven't followed all steps here, but I sense there is indeed numpy-coder-level benefit to fusing. However I, like Matthew (I believe), don't see appreciable benefits at the user level, /plus/ the risk of user confusion ... -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. From charlesr.harris at gmail.com Thu Jun 30 21:10:04 2011 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 30 Jun 2011 19:10:04 -0600 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: <87r56bnpi6.fsf@ginnungagap.bsc.es> <87zkkzjdh8.fsf@ginnungagap.bsc.es> Message-ID: On Thu, Jun 30, 2011 at 6:02 PM, Matthew Brett wrote: > Hi, > > On Thu, Jun 30, 2011 at 9:01 PM, Llu?s wrote: > > Matthew Brett writes: > > > >> Hi, > >> On Thu, Jun 30, 2011 at 7:27 PM, Llu?s wrote: > >>> Matthew Brett writes: > >>> [...] > >>>> I'm afraid, like you, I'm a little lost in the world of masking, > >>>> because I only need the NAs. I was trying to see if I could come up > >>>> with an API that picked up some of the syntactic convenience of NAs, > >>>> without conflating NAs with IGNOREs. I guess we need some feedback > >>>> from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea > >>>> of what we've missed. @Mark, @Chuck, guys - what have we lost here by > >>>> separating the APIs? > >>> > >>> As I tried to convey on my other mail, separating both will force you > to > >>> either: > >>> > >>> * Make a copy of the array before passing it to another routine > (because > >>> the routine will assign np.NA but you still want the original data) > > > >> You have an array 'arr'. The array does support NAs, but it doesn't > >> have a mask. You want to pass ``arr`` to another routine ``func``. > >> You expect ``func`` to set NAs into the data but you don't want > >> ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. > >> You are saying the following: > > > >> "with the fused API, I can make ``arr`` be a masked array, and pass it > >> into ``func``, and know that, when func sets elements of arr to NA, it > >> will only modify the mask and not the underlying data in ``arr``." > > > > Yes. > > > > > >> It does seem to me this is a very obscure case. First, ``func`` is > >> modifying the array but you want an unmodified array back. Second, > >> you'll have to do some view trick to recover the not-NA case to arr, > >> when it comes back. > > > > I know, the example is just silly and convoluted. > > > > > >> It seems to me, that what ``func`` should do, if it wants you to be > >> able to unmask the NAs, is to make a masked array view of ``arr``, and > >> return that. And indeed the simplicity of the separated API > >> immediately makes that clear - in my view at least. > > > > I agree on this example. My only concern is on the API's ability to > > foresee as most future use-cases as possible, without impacting > > performance. > > But, of course, there's a great danger in trying to cover every > possible use-case. > > My argument is that the kind of cases that you are describe are - I > believe - very rare and are even a little difficult to make up. Is > that fair? > > To my mind, the separate NA and IGNORE API is easier to understand and > explain. If that isn't true, please do say, and say why - because > that point is key. > > I think the main problem is that they aren't separate, one takes place in a view of an unmasked array, the other starts with a masked array. These aren't 'different' in mechanism, they are just different in work flow. And I think they fit in well with the view idea. > If it is true that the separate API is clearer, then the benefit in > terms of power and extensibility has to be large, in order to go for > the fused API. > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From kwgoodman at gmail.com Thu Jun 30 21:36:18 2011 From: kwgoodman at gmail.com (Keith Goodman) Date: Thu, 30 Jun 2011 18:36:18 -0700 Subject: [Numpy-discussion] alterNEP - was: missing data discussion round 2 In-Reply-To: References: Message-ID: On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith wrote: > On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett wrote: >> In the interest of making the discussion as concrete as possible, here >> is my draft of an alternative proposal for NAs and masking, based on >> Nathaniel's comments. ?Writing it, it seemed to me that Nathaniel is >> right, that the ideas become much clearer when the NA idea and the >> MASK idea are separate. ? Please do pitch in for things I may have >> missed or misunderstood: > [...] > > Thanks for writing this up! I stuck it up as a gist so we can edit it > more easily: > ?https://gist.github.com/1056379/ > This is your initial version: > ?https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 > And I made a few changes: > ?https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 > Specifically, I added a rationale section, changed np.MASKED to > np.IGNORE (as per comments in this thread), and added a vowel to > "propmsk". It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP. I only had a few minutes, so I only took it this far (1d arrays only): >> from nary import nary, NA, IGNORE >> arr = np.array([1,2,3,4,5,6]) >> nar = nary(arr) >> nar 1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000, >> nar[2] = NA >> nar 1.0000, 2.0000, NA, 4.0000, 5.0000, 6.0000, >> nar[4] = IGNORE >> nar 1.0000, 2.0000, NA, 4.0000, IGNORE, 6.0000, >> nar[4] IGNORE >> nar[3] 4 >> nar[2] NA The gist is here: https://gist.github.com/1057686 It probably just needs an __add__ and a reducing function such as sum, but I'm out of time, or so my family tells me. Implementation? Yes, with masks.