From chekir.amira at gmail.com Wed Jan 1 10:45:09 2014 From: chekir.amira at gmail.com (Amira Chekir) Date: Wed, 1 Jan 2014 16:45:09 +0100 Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 87, Issue 35 In-Reply-To: References: Message-ID: Hi, Thanks for your answer. I use ubuntu 12.04 32 bits and python 2.7 I upgrade numpy to 1.8, but the error persists I think that the problem is in gzip.py : max_read_chunk = 10 * 1024 * 1024 # 10Mb What do you think? Best regards, AMIRA 2013/12/31 > Send NumPy-Discussion mailing list submissions to > numpy-discussion at scipy.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mail.scipy.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at scipy.org > > You can reach the person managing the list at > numpy-discussion-owner at scipy.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. Loading large NIfTI file -> MemoryError (Amira Chekir) > 2. Re: Loading large NIfTI file -> MemoryError (Julian Taylor) > 3. Re: proposal: min, max of complex should give warning (Cera, > Tim) > 4. Re: proposal: min, max of complex should give warning > (Neal Becker) > 5. Re: proposal: min, max of complex should give warning > (Ralf Gommers) > 6. Re: proposal: min, max of complex should give warning > (Neal Becker) > 7. ANN: NumPy 1.7.2 release (Julian Taylor) > 8. Re: ANN: NumPy 1.7.2 release (Charles R Harris) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 31 Dec 2013 14:13:57 +0100 > From: Amira Chekir > Subject: [Numpy-discussion] Loading large NIfTI file -> MemoryError > To: numpy-discussion at scipy.org > Message-ID: > EQ29Zw at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Hello together, > > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, > about 1 GB) with NiBabel. > > import nibabel as nib > img = nib.load("dmri.nii.gz") > data = img.get_data() > > The program crashes during "img.get_data()" with an "MemoryError" (having 4 > GB of RAM in my machine). > > Any suggestions? > > Best regards, > AMIRA > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/b13969b3/attachment-0001.html > > ------------------------------ > > Message: 2 > Date: Tue, 31 Dec 2013 14:29:42 +0100 > From: Julian Taylor > Subject: Re: [Numpy-discussion] Loading large NIfTI file -> > MemoryError > To: Discussion of Numerical Python > Message-ID: <52C2C6C6.6070002 at googlemail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On 31.12.2013 14:13, Amira Chekir wrote: > > Hello together, > > > > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, > > about 1 GB) with NiBabel. > > > > import nibabel as nib > > img = nib.load("dmri.nii.gz") > > data = img.get_data() > > > > The program crashes during "img.get_data()" with an "MemoryError" > > (having 4 GB of RAM in my machine). > > > > Any suggestions? > > are you using a 64 bit operating system? > which version of numpy? > > assuming nibabel uses np.load under the hood you could try it with numpy > 1.8 which reduces excess memory usage when loading compressed files. > > > ------------------------------ > > Message: 3 > Date: Tue, 31 Dec 2013 08:51:52 -0500 > From: "Cera, Tim" > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > give warning > To: Discussion of Numerical Python > Message-ID: > < > CAO5s+D_m5N6SJgsKoV7O-+yHh5gPnB0_a-ozKgETGRwTgN_axg at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > I don't work with complex numbers, but just sampling what others do: > > > Python: no ordering, results in TypeError > > Matlab: sorts by magnitude > http://www.mathworks.com/help/matlab/ref/sort.html > > R: sorts first by real, then by imaginary > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > Numpy: sorts first by real, then by imaginary (the documentation link > below calls this sort 'lexicographical' which I don't think is > correct) > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > I would think that the Matlab sort might be more useful, but easy > enough by using the absolute value. > > I think what Numpy does is normal enough to not justify a warning, but > leave this to others because as I pointed out in the beginning I don't > work with complex numbers. > > Kindest regards, > Tim > > > ------------------------------ > > Message: 4 > Date: Tue, 31 Dec 2013 10:52:47 -0500 > From: Neal Becker > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > give warning > To: numpy-discussion at scipy.org > Message-ID: > Content-Type: text/plain; charset="ISO-8859-1" > > Cera, Tim wrote: > > > I don't work with complex numbers, but just sampling what others do: > > > > > > Python: no ordering, results in TypeError > > > > Matlab: sorts by magnitude > > http://www.mathworks.com/help/matlab/ref/sort.html > > > > R: sorts first by real, then by imaginary > > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > > > Numpy: sorts first by real, then by imaginary (the documentation link > > below calls this sort 'lexicographical' which I don't think is > > correct) > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > > > > I would think that the Matlab sort might be more useful, but easy > > enough by using the absolute value. > > > > I think what Numpy does is normal enough to not justify a warning, but > > leave this to others because as I pointed out in the beginning I don't > > work with complex numbers. > > > > Kindest regards, > > Tim > > But I'm not proposing to change numpy's result, which I'm sure would raise > many > objections. I'm just asking to give a warning, because I think in most > cases > this is actually a mistake on the user's part. Just like the warning > currently > given when complex data are truncated to real part. > > > > ------------------------------ > > Message: 5 > Date: Tue, 31 Dec 2013 17:24:05 +0100 > From: Ralf Gommers > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > give warning > To: Discussion of Numerical Python > Message-ID: > < > CABL7CQh9Fc0Uh36W9p16mzAR-oYjJ7_k7rU_Dwq+eZND6YrbDA at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Dec 31, 2013 at 4:52 PM, Neal Becker wrote: > > > Cera, Tim wrote: > > > > > I don't work with complex numbers, but just sampling what others do: > > > > > > > > > Python: no ordering, results in TypeError > > > > > > Matlab: sorts by magnitude > > > http://www.mathworks.com/help/matlab/ref/sort.html > > > > > > R: sorts first by real, then by imaginary > > > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > > > > > Numpy: sorts first by real, then by imaginary (the documentation link > > > below calls this sort 'lexicographical' which I don't think is > > > correct) > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > > > > > > > I would think that the Matlab sort might be more useful, but easy > > > enough by using the absolute value. > > > > > > I think what Numpy does is normal enough to not justify a warning, but > > > leave this to others because as I pointed out in the beginning I don't > > > work with complex numbers. > > > > > > Kindest regards, > > > Tim > > > > But I'm not proposing to change numpy's result, which I'm sure would > raise > > many > > objections. I'm just asking to give a warning, because I think in most > > cases > > this is actually a mistake on the user's part. Just like the warning > > currently > > given when complex data are truncated to real part. > > > > Keep in mind that warnings can be highly annoying. If you're a user who > uses this functionality regularly (and you know what you're doing), then > you're going to be very unhappy to have to wrap each function call in: > olderr = np.seterr(all='ignore') > max(...) > np.seterr(**olderr) > or in: > with warnings.catch_warnings(): > warnings.filterwarnings('ignore', ...) > max(...) > > The actual behavior isn't documented now it looks like, so that should be > done. In the Notes section of max/min probably. > > As for your proposal, it would be good to know if adding a warning would > actually catch any bugs. For the truncation warning it caught several in > scipy and other libs IIRC. > > Ralf > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/add729d8/attachment-0001.html > > ------------------------------ > > Message: 6 > Date: Tue, 31 Dec 2013 11:45:08 -0500 > From: Neal Becker > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > give warning > To: numpy-discussion at scipy.org > Message-ID: > Content-Type: text/plain; charset="ISO-8859-1" > > Ralf Gommers wrote: > > > On Tue, Dec 31, 2013 at 4:52 PM, Neal Becker > wrote: > > > >> Cera, Tim wrote: > >> > >> > I don't work with complex numbers, but just sampling what others do: > >> > > >> > > >> > Python: no ordering, results in TypeError > >> > > >> > Matlab: sorts by magnitude > >> > http://www.mathworks.com/help/matlab/ref/sort.html > >> > > >> > R: sorts first by real, then by imaginary > >> > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > >> > > >> > Numpy: sorts first by real, then by imaginary (the documentation link > >> > below calls this sort 'lexicographical' which I don't think is > >> > correct) > >> > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > >> > > >> > > >> > I would think that the Matlab sort might be more useful, but easy > >> > enough by using the absolute value. > >> > > >> > I think what Numpy does is normal enough to not justify a warning, but > >> > leave this to others because as I pointed out in the beginning I don't > >> > work with complex numbers. > >> > > >> > Kindest regards, > >> > Tim > >> > >> But I'm not proposing to change numpy's result, which I'm sure would > raise > >> many > >> objections. I'm just asking to give a warning, because I think in most > >> cases > >> this is actually a mistake on the user's part. Just like the warning > >> currently > >> given when complex data are truncated to real part. > >> > > > > Keep in mind that warnings can be highly annoying. If you're a user who > > uses this functionality regularly (and you know what you're doing), then > > you're going to be very unhappy to have to wrap each function call in: > > olderr = np.seterr(all='ignore') > > max(...) > > np.seterr(**olderr) > > or in: > > with warnings.catch_warnings(): > > warnings.filterwarnings('ignore', ...) > > max(...) > > > > The actual behavior isn't documented now it looks like, so that should be > > done. In the Notes section of max/min probably. > > > > As for your proposal, it would be good to know if adding a warning would > > actually catch any bugs. For the truncation warning it caught several in > > scipy and other libs IIRC. > > > > Ralf > > I tripped over it yesterday, which is what prompted my suggestion. > > > > ------------------------------ > > Message: 7 > Date: Tue, 31 Dec 2013 17:57:18 +0100 > From: Julian Taylor > Subject: [Numpy-discussion] ANN: NumPy 1.7.2 release > To: Discussion of Numerical Python , > SciPy Users List , SciPy Developers > List > > Message-ID: <52C2F76E.9010509 at googlemail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > I'm happy to announce the of Numpy 1.7.2. > This is a bugfix only release supporting Python 2.4 - 2.7 and 3.1 - 3.3. > > More than 42 issues were fixed, the most important issues are listed in > the release notes: > https://github.com/numpy/numpy/blob/v1.7.2/doc/release/1.7.2-notes.rst > > Compared to the last release candidate four additional minor issues have > been fixed and compatibility with python 3.4b1 improved. > > Source tarballs, installers and release notes can be found at > https://sourceforge.net/projects/numpy/files/NumPy/1.7.2 > > Cheers, > Julian Taylor > > > ------------------------------ > > Message: 8 > Date: Tue, 31 Dec 2013 10:47:44 -0700 > From: Charles R Harris > Subject: Re: [Numpy-discussion] ANN: NumPy 1.7.2 release > To: Discussion of Numerical Python > Message-ID: > abrqm4DNRG7f6-1keU_hPd253O64d0-Yhw at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Dec 31, 2013 at 9:57 AM, Julian Taylor < > jtaylor.debian at googlemail.com> wrote: > > > Hello, > > > > I'm happy to announce the of Numpy 1.7.2. > > This is a bugfix only release supporting Python 2.4 - 2.7 and 3.1 - 3.3. > > > > More than 42 issues were fixed, the most important issues are listed in > > the release notes: > > https://github.com/numpy/numpy/blob/v1.7.2/doc/release/1.7.2-notes.rst > > > > Compared to the last release candidate four additional minor issues have > > been fixed and compatibility with python 3.4b1 improved. > > > > Source tarballs, installers and release notes can be found at > > https://sourceforge.net/projects/numpy/files/NumPy/1.7.2 > > > > > Congrats on the release. > > Chuck > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/946abcb9/attachment.html > > ------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > End of NumPy-Discussion Digest, Vol 87, Issue 35 > ************************************************ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekir.amira at gmail.com Wed Jan 1 10:50:11 2014 From: chekir.amira at gmail.com (Amira Chekir) Date: Wed, 1 Jan 2014 16:50:11 +0100 Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 88, Issue 1 In-Reply-To: References: Message-ID: On 31.12.2013 14:13, Amira Chekir wrote: > > Hello together, > > > > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, > > about 1 GB) with NiBabel. > > > > import nibabel as nib > > img = nib.load("dmri.nii.gz") > > data = img.get_data() > > > > The program crashes during "img.get_data()" with an "MemoryError" > > (having 4 GB of RAM in my machine). > > > > Any suggestions? > > are you using a 64 bit operating system? > which version of numpy? > > assuming nibabel uses np.load under the hood you could try it with numpy > 1.8 which reduces excess memory usage when loading compressed files. Hi, Thanks for your answer. I use ubuntu 12.04 32 bits and python 2.7 I upgrade numpy to 1.8, but the error persists I think that the problem is in gzip.py : max_read_chunk = 10 * 1024 * 1024 # 10Mb What do you think? Best regards, AMIRA 2014/1/1 > Send NumPy-Discussion mailing list submissions to > numpy-discussion at scipy.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://mail.scipy.org/mailman/listinfo/numpy-discussion > or, via email, send a message with subject or body 'help' to > numpy-discussion-request at scipy.org > > You can reach the person managing the list at > numpy-discussion-owner at scipy.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of NumPy-Discussion digest..." > > > Today's Topics: > > 1. Re: proposal: min, max of complex should give warning (Ralf > Gommers) (David Goldsmith) > 2. Re: NumPy-Discussion Digest, Vol 87, Issue 35 (Amira Chekir) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 31 Dec 2013 11:43:49 -0800 > From: David Goldsmith > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > give warning (Ralf Gommers) > To: numpy-discussion at scipy.org > Message-ID: > rWU6EVuBMG+mY-XJdA at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > > > > As for your proposal, it would be good to know if adding a warning would > > actually catch any bugs. For the truncation warning it caught several in > > scipy and other libs IIRC. > > > > Ralf > > > > In light of this, perhaps the pertinent unit tests should be modified (even > if the warning suggestion isn't adopted, about which I'm neutral...but I'm > a little surprised that there isn't a generic way to globally turn off > specific warnings). > > DG > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/ac17f43e/attachment-0001.html > > ------------------------------ > > Message: 2 > Date: Wed, 1 Jan 2014 16:45:09 +0100 > From: Amira Chekir > Subject: Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 87, Issue > 35 > To: numpy-discussion at scipy.org > Message-ID: > < > CAB-foYhZMYH+asXUC_SnO6bjCDSOji+d8J6tyqSvucNOv_dyiQ at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Hi, > Thanks for your answer. > I use ubuntu 12.04 32 bits and python 2.7 > I upgrade numpy to 1.8, but the error persists > I think that the problem is in gzip.py : > max_read_chunk = 10 * 1024 * 1024 # 10Mb > What do you think? > > Best regards, > AMIRA > > > 2013/12/31 > > > Send NumPy-Discussion mailing list submissions to > > numpy-discussion at scipy.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > or, via email, send a message with subject or body 'help' to > > numpy-discussion-request at scipy.org > > > > You can reach the person managing the list at > > numpy-discussion-owner at scipy.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of NumPy-Discussion digest..." > > > > > > Today's Topics: > > > > 1. Loading large NIfTI file -> MemoryError (Amira Chekir) > > 2. Re: Loading large NIfTI file -> MemoryError (Julian Taylor) > > 3. Re: proposal: min, max of complex should give warning (Cera, > > Tim) > > 4. Re: proposal: min, max of complex should give warning > > (Neal Becker) > > 5. Re: proposal: min, max of complex should give warning > > (Ralf Gommers) > > 6. Re: proposal: min, max of complex should give warning > > (Neal Becker) > > 7. ANN: NumPy 1.7.2 release (Julian Taylor) > > 8. Re: ANN: NumPy 1.7.2 release (Charles R Harris) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 31 Dec 2013 14:13:57 +0100 > > From: Amira Chekir > > Subject: [Numpy-discussion] Loading large NIfTI file -> MemoryError > > To: numpy-discussion at scipy.org > > Message-ID: > > > EQ29Zw at mail.gmail.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > Hello together, > > > > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, > > about 1 GB) with NiBabel. > > > > import nibabel as nib > > img = nib.load("dmri.nii.gz") > > data = img.get_data() > > > > The program crashes during "img.get_data()" with an "MemoryError" > (having 4 > > GB of RAM in my machine). > > > > Any suggestions? > > > > Best regards, > > AMIRA > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > > > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/b13969b3/attachment-0001.html > > > > ------------------------------ > > > > Message: 2 > > Date: Tue, 31 Dec 2013 14:29:42 +0100 > > From: Julian Taylor > > Subject: Re: [Numpy-discussion] Loading large NIfTI file -> > > MemoryError > > To: Discussion of Numerical Python > > Message-ID: <52C2C6C6.6070002 at googlemail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > On 31.12.2013 14:13, Amira Chekir wrote: > > > Hello together, > > > > > > I try to load a (large) NIfTI file (DMRI from Human Connectome > Project, > > > about 1 GB) with NiBabel. > > > > > > import nibabel as nib > > > img = nib.load("dmri.nii.gz") > > > data = img.get_data() > > > > > > The program crashes during "img.get_data()" with an "MemoryError" > > > (having 4 GB of RAM in my machine). > > > > > > Any suggestions? > > > > are you using a 64 bit operating system? > > which version of numpy? > > > > assuming nibabel uses np.load under the hood you could try it with numpy > > 1.8 which reduces excess memory usage when loading compressed files. > > > > > > ------------------------------ > > > > Message: 3 > > Date: Tue, 31 Dec 2013 08:51:52 -0500 > > From: "Cera, Tim" > > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > > give warning > > To: Discussion of Numerical Python > > Message-ID: > > < > > CAO5s+D_m5N6SJgsKoV7O-+yHh5gPnB0_a-ozKgETGRwTgN_axg at mail.gmail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > I don't work with complex numbers, but just sampling what others do: > > > > > > Python: no ordering, results in TypeError > > > > Matlab: sorts by magnitude > > http://www.mathworks.com/help/matlab/ref/sort.html > > > > R: sorts first by real, then by imaginary > > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > > > Numpy: sorts first by real, then by imaginary (the documentation link > > below calls this sort 'lexicographical' which I don't think is > > correct) > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > > > > I would think that the Matlab sort might be more useful, but easy > > enough by using the absolute value. > > > > I think what Numpy does is normal enough to not justify a warning, but > > leave this to others because as I pointed out in the beginning I don't > > work with complex numbers. > > > > Kindest regards, > > Tim > > > > > > ------------------------------ > > > > Message: 4 > > Date: Tue, 31 Dec 2013 10:52:47 -0500 > > From: Neal Becker > > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > > give warning > > To: numpy-discussion at scipy.org > > Message-ID: > > Content-Type: text/plain; charset="ISO-8859-1" > > > > Cera, Tim wrote: > > > > > I don't work with complex numbers, but just sampling what others do: > > > > > > > > > Python: no ordering, results in TypeError > > > > > > Matlab: sorts by magnitude > > > http://www.mathworks.com/help/matlab/ref/sort.html > > > > > > R: sorts first by real, then by imaginary > > > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > > > > > Numpy: sorts first by real, then by imaginary (the documentation link > > > below calls this sort 'lexicographical' which I don't think is > > > correct) > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > > > > > > > I would think that the Matlab sort might be more useful, but easy > > > enough by using the absolute value. > > > > > > I think what Numpy does is normal enough to not justify a warning, but > > > leave this to others because as I pointed out in the beginning I don't > > > work with complex numbers. > > > > > > Kindest regards, > > > Tim > > > > But I'm not proposing to change numpy's result, which I'm sure would > raise > > many > > objections. I'm just asking to give a warning, because I think in most > > cases > > this is actually a mistake on the user's part. Just like the warning > > currently > > given when complex data are truncated to real part. > > > > > > > > ------------------------------ > > > > Message: 5 > > Date: Tue, 31 Dec 2013 17:24:05 +0100 > > From: Ralf Gommers > > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > > give warning > > To: Discussion of Numerical Python > > Message-ID: > > < > > CABL7CQh9Fc0Uh36W9p16mzAR-oYjJ7_k7rU_Dwq+eZND6YrbDA at mail.gmail.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > On Tue, Dec 31, 2013 at 4:52 PM, Neal Becker > wrote: > > > > > Cera, Tim wrote: > > > > > > > I don't work with complex numbers, but just sampling what others do: > > > > > > > > > > > > Python: no ordering, results in TypeError > > > > > > > > Matlab: sorts by magnitude > > > > http://www.mathworks.com/help/matlab/ref/sort.html > > > > > > > > R: sorts first by real, then by imaginary > > > > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > > > > > > > Numpy: sorts first by real, then by imaginary (the documentation link > > > > below calls this sort 'lexicographical' which I don't think is > > > > correct) > > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > > > > > > > > > > > I would think that the Matlab sort might be more useful, but easy > > > > enough by using the absolute value. > > > > > > > > I think what Numpy does is normal enough to not justify a warning, > but > > > > leave this to others because as I pointed out in the beginning I > don't > > > > work with complex numbers. > > > > > > > > Kindest regards, > > > > Tim > > > > > > But I'm not proposing to change numpy's result, which I'm sure would > > raise > > > many > > > objections. I'm just asking to give a warning, because I think in most > > > cases > > > this is actually a mistake on the user's part. Just like the warning > > > currently > > > given when complex data are truncated to real part. > > > > > > > Keep in mind that warnings can be highly annoying. If you're a user who > > uses this functionality regularly (and you know what you're doing), then > > you're going to be very unhappy to have to wrap each function call in: > > olderr = np.seterr(all='ignore') > > max(...) > > np.seterr(**olderr) > > or in: > > with warnings.catch_warnings(): > > warnings.filterwarnings('ignore', ...) > > max(...) > > > > The actual behavior isn't documented now it looks like, so that should be > > done. In the Notes section of max/min probably. > > > > As for your proposal, it would be good to know if adding a warning would > > actually catch any bugs. For the truncation warning it caught several in > > scipy and other libs IIRC. > > > > Ralf > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > > > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/add729d8/attachment-0001.html > > > > ------------------------------ > > > > Message: 6 > > Date: Tue, 31 Dec 2013 11:45:08 -0500 > > From: Neal Becker > > Subject: Re: [Numpy-discussion] proposal: min, max of complex should > > give warning > > To: numpy-discussion at scipy.org > > Message-ID: > > Content-Type: text/plain; charset="ISO-8859-1" > > > > Ralf Gommers wrote: > > > > > On Tue, Dec 31, 2013 at 4:52 PM, Neal Becker > > wrote: > > > > > >> Cera, Tim wrote: > > >> > > >> > I don't work with complex numbers, but just sampling what others do: > > >> > > > >> > > > >> > Python: no ordering, results in TypeError > > >> > > > >> > Matlab: sorts by magnitude > > >> > http://www.mathworks.com/help/matlab/ref/sort.html > > >> > > > >> > R: sorts first by real, then by imaginary > > >> > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > > >> > > > >> > Numpy: sorts first by real, then by imaginary (the documentation > link > > >> > below calls this sort 'lexicographical' which I don't think is > > >> > correct) > > >> > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > > >> > > > >> > > > >> > I would think that the Matlab sort might be more useful, but easy > > >> > enough by using the absolute value. > > >> > > > >> > I think what Numpy does is normal enough to not justify a warning, > but > > >> > leave this to others because as I pointed out in the beginning I > don't > > >> > work with complex numbers. > > >> > > > >> > Kindest regards, > > >> > Tim > > >> > > >> But I'm not proposing to change numpy's result, which I'm sure would > > raise > > >> many > > >> objections. I'm just asking to give a warning, because I think in > most > > >> cases > > >> this is actually a mistake on the user's part. Just like the warning > > >> currently > > >> given when complex data are truncated to real part. > > >> > > > > > > Keep in mind that warnings can be highly annoying. If you're a user who > > > uses this functionality regularly (and you know what you're doing), > then > > > you're going to be very unhappy to have to wrap each function call in: > > > olderr = np.seterr(all='ignore') > > > max(...) > > > np.seterr(**olderr) > > > or in: > > > with warnings.catch_warnings(): > > > warnings.filterwarnings('ignore', ...) > > > max(...) > > > > > > The actual behavior isn't documented now it looks like, so that should > be > > > done. In the Notes section of max/min probably. > > > > > > As for your proposal, it would be good to know if adding a warning > would > > > actually catch any bugs. For the truncation warning it caught several > in > > > scipy and other libs IIRC. > > > > > > Ralf > > > > I tripped over it yesterday, which is what prompted my suggestion. > > > > > > > > ------------------------------ > > > > Message: 7 > > Date: Tue, 31 Dec 2013 17:57:18 +0100 > > From: Julian Taylor > > Subject: [Numpy-discussion] ANN: NumPy 1.7.2 release > > To: Discussion of Numerical Python , > > SciPy Users List , SciPy Developers > > List > > > > Message-ID: <52C2F76E.9010509 at googlemail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Hello, > > > > I'm happy to announce the of Numpy 1.7.2. > > This is a bugfix only release supporting Python 2.4 - 2.7 and 3.1 - 3.3. > > > > More than 42 issues were fixed, the most important issues are listed in > > the release notes: > > https://github.com/numpy/numpy/blob/v1.7.2/doc/release/1.7.2-notes.rst > > > > Compared to the last release candidate four additional minor issues have > > been fixed and compatibility with python 3.4b1 improved. > > > > Source tarballs, installers and release notes can be found at > > https://sourceforge.net/projects/numpy/files/NumPy/1.7.2 > > > > Cheers, > > Julian Taylor > > > > > > ------------------------------ > > > > Message: 8 > > Date: Tue, 31 Dec 2013 10:47:44 -0700 > > From: Charles R Harris > > Subject: Re: [Numpy-discussion] ANN: NumPy 1.7.2 release > > To: Discussion of Numerical Python > > Message-ID: > > > abrqm4DNRG7f6-1keU_hPd253O64d0-Yhw at mail.gmail.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > On Tue, Dec 31, 2013 at 9:57 AM, Julian Taylor < > > jtaylor.debian at googlemail.com> wrote: > > > > > Hello, > > > > > > I'm happy to announce the of Numpy 1.7.2. > > > This is a bugfix only release supporting Python 2.4 - 2.7 and 3.1 - > 3.3. > > > > > > More than 42 issues were fixed, the most important issues are listed in > > > the release notes: > > > https://github.com/numpy/numpy/blob/v1.7.2/doc/release/1.7.2-notes.rst > > > > > > Compared to the last release candidate four additional minor issues > have > > > been fixed and compatibility with python 3.4b1 improved. > > > > > > Source tarballs, installers and release notes can be found at > > > https://sourceforge.net/projects/numpy/files/NumPy/1.7.2 > > > > > > > > Congrats on the release. > > > > Chuck > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > > > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20131231/946abcb9/attachment.html > > > > ------------------------------ > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > End of NumPy-Discussion Digest, Vol 87, Issue 35 > > ************************************************ > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mail.scipy.org/pipermail/numpy-discussion/attachments/20140101/279def51/attachment.html > > ------------------------------ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > End of NumPy-Discussion Digest, Vol 88, Issue 1 > *********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Jan 1 10:56:14 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 01 Jan 2014 16:56:14 +0100 Subject: [Numpy-discussion] NumPy-Discussion Digest, Vol 88, Issue 1 In-Reply-To: References: Message-ID: <52C43A9E.7060403@googlemail.com> On 01.01.2014 16:50, Amira Chekir wrote: > On 31.12.2013 14:13, Amira Chekir wrote: >> > Hello together, >> > >> > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, >> > about 1 GB) with NiBabel. >> > >> > import nibabel as nib >> > img = nib.load("dmri.nii.gz") >> > data = img.get_data() >> > >> > The program crashes during "img.get_data()" with an "MemoryError" >> > (having 4 GB of RAM in my machine). >> > >> > Any suggestions? >> >> are you using a 64 bit operating system? >> which version of numpy? >> >> assuming nibabel uses np.load under the hood you could try it with numpy >> 1.8 which reduces excess memory usage when loading compressed files. > > Hi, > Thanks for your answer. > I use ubuntu 12.04 32 bits and python 2.7 > I upgrade numpy to 1.8, but the error persists > I think that the problem is in gzip.py : > max_read_chunk = 10 * 1024 * 1024 # 10Mb > What do you think? > On a 32 bit system you can only use 2GB of ram (even if you have 4GB). A single copy of your data will already exhaust this and this can be hard to avoid with numpy. Use an 64 bit operating system with more RAM or somehow try to chunk your workload into smaller sizes. From bartbkr at gmail.com Wed Jan 1 14:56:44 2014 From: bartbkr at gmail.com (Bart Baker) Date: Wed, 1 Jan 2014 14:56:44 -0500 Subject: [Numpy-discussion] Altering/initializing NumPy array in C Message-ID: Hello, I'm having issues with performing operations on an array in C and passing it back to Python. The array values seem to become unitialized upon being passed back to Python. My first attempt involved initializing the array in C as so: double a_fin[max_mth]; where max_mth is an int. I fill in the values of a_fin and then, before returning back to Python, I create a NumPy array in C and fill it in using a pointer to the a_fin array: npy_intp a_dims[2] = {max_mth, 1}; a_fin_array = (PyArrayObject *) PyArray_SimpleNewFromData(2, a_dims, NPY_DOUBLE, a_fin); I update the flags as so: PyArray_UpdateFlags(a_fin_array, NPY_OWNDATA); and return using: PyObject *Result = Py_BuildValue("OO", a_fin_array, b_fin_array); Py_DECREF(a_fin_array); (there is another array, b_bin_array that I create in this way and it suffers from the same issues). Immediately upon returning to Python, all of a_fin_array appears unitilized. This only happens in certain situations and sometime only part of the arrary will be unitilized. I check the values of a_dim and a_fin_array in C using gdb and they appear as expected, but are over-written with unitialized values upon returning to Python. I've tried initializing in Python and then passing the NumPy array in instead of initializing in C, but the effects of the calculations in C are still not kept. My feeling is that, with the Numpy C-API, this should be a simple process, but I'm having a lot of trouble with it. Any help would be much appreciated. I didn't want to give too much information in the post, but please let me know what other information would be useful. -Bart From njs at pobox.com Wed Jan 1 15:04:52 2014 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 1 Jan 2014 14:04:52 -0600 Subject: [Numpy-discussion] Altering/initializing NumPy array in C In-Reply-To: References: Message-ID: On 1 Jan 2014 13:57, "Bart Baker" wrote: > > Hello, > > I'm having issues with performing operations on an array in C and > passing it back to Python. The array values seem to become unitialized > upon being passed back to Python. My first attempt involved initializing > the array in C as so: > > double a_fin[max_mth]; > > where max_mth is an int. You're stack-allocating your array, so the memory is getting recycled for other uses as soon as your C function returns. You should malloc it instead (but you don't have to worry about free'ing it, numpy will do that when the array object is deconstructed). Any C reference will fill you in on the details of stack versus malloc allocation. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From bartbkr at gmail.com Thu Jan 2 08:49:11 2014 From: bartbkr at gmail.com (Bart Baker) Date: Thu, 2 Jan 2014 08:49:11 -0500 Subject: [Numpy-discussion] Altering/initializing NumPy array in C In-Reply-To: References: Message-ID: > You're stack-allocating your array, so the memory is getting recycled for > other uses as soon as your C function returns. You should malloc it instead > (but you don't have to worry about free'ing it, numpy will do that when the > array object is deconstructed). Any C reference will fill you in on the > details of stack versus malloc allocation. OK, that makes a lot of sense. I changed them to malloc's of the appropriate size and now things seems to be working well. It also led to a good read on stack vs heap. Thanks a lot, Bart From d.l.goldsmith at gmail.com Thu Jan 2 15:29:42 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Thu, 2 Jan 2014 12:29:42 -0800 Subject: [Numpy-discussion] Quaternion type @ rosettacode.org Message-ID: Anyone here use/have an opinion about the Quaternion type @ rosettacode.org? Or have an opinion about it having derived the type from collections.namedtuple? Anyone have an open-source, numpy-based alternative? Ditto last question for Octonion and/or general n-basis Grassmann (exterior) and/or Clifford Algebras? (rosettacode appears to have none of these). Thanks! David Goldsmith -------------- next part -------------- An HTML attachment was scrubbed... URL: From scopatz at gmail.com Thu Jan 2 15:44:23 2014 From: scopatz at gmail.com (Anthony Scopatz) Date: Thu, 2 Jan 2014 12:44:23 -0800 Subject: [Numpy-discussion] Quaternion type @ rosettacode.org In-Reply-To: References: Message-ID: Hello David, There is a numpy-quarterion repo that has served me well in the past. I believe this came out of a SciPy 2011 sprint. See https://github.com/martinling/numpy_quaternion. I hope this helps. Be Well Anthony On Thu, Jan 2, 2014 at 12:29 PM, David Goldsmith wrote: > Anyone here use/have an opinion about the Quaternion type @ > rosettacode.org? > Or have an opinion about it having derived the type from > collections.namedtuple? Anyone have an open-source, numpy-based > alternative? Ditto last question for Octonion and/or general n-basis > Grassmann (exterior) and/or Clifford Algebras? (rosettacode appears to > have none of these). Thanks! > > David Goldsmith > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.leopardi at anu.edu.au Thu Jan 2 16:16:53 2014 From: paul.leopardi at anu.edu.au (Paul Leopardi) Date: Fri, 3 Jan 2014 08:16:53 +1100 Subject: [Numpy-discussion] Quaternion type @ rosettacode.org In-Reply-To: References: Message-ID: <5706837.Fip8dVJlit@linfinit> On Thu, 2 Jan 2014 12:29:42 David Goldsmith wrote: > Anyone here use/have an opinion about the Quaternion type @ > rosettacode.org tions#Python>? Or have an opinion about it having derived the type from > collections.namedtuple? Anyone have an open-source, numpy-based > alternative? Ditto last question for Octonion and/or general n-basis > Grassmann (exterior) and/or Clifford Algebras? (rosettacode appears to > have none of these). Thanks! Hi David, Not Numpy based, but: GluCat http://sourceforge.net/projects/glucat/ is an open source C++ library for calculations in Clifford algebras, based on the C++ Standard Library and Boost uBLAS. It also includes PyClical, a Python extension module coded in Cython. The PyClical tutorials and demos at http://sourceforge.net/p/glucat/git/ci/master/tree/pyclical/demos/ should give you an idea of how PyClical can be used with Numpy, SciPy and the rest of Python. See also http://sourceforge.net/p/glucat/git/ci/master/tree/README If you have compilation problems, try the release_0_7_1-patches branch: http://sourceforge.net/p/glucat/git/ci/release_0_7_1-patches/ I am always open to feedback and criticism of this code. All the best, Paul -- Paul Leopardi http://www.maths.anu.edu.au/~leopardi -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Fri Jan 3 05:39:25 2014 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 3 Jan 2014 10:39:25 +0000 Subject: [Numpy-discussion] Loading large NIfTI file -> MemoryError In-Reply-To: <52C2C6C6.6070002@googlemail.com> References: <52C2C6C6.6070002@googlemail.com> Message-ID: Hi, On Tue, Dec 31, 2013 at 1:29 PM, Julian Taylor wrote: > > On 31.12.2013 14:13, Amira Chekir wrote: > > Hello together, > > > > I try to load a (large) NIfTI file (DMRI from Human Connectome Project, > > about 1 GB) with NiBabel. > > > > import nibabel as nib > > img = nib.load("dmri.nii.gz") > > data = img.get_data() > > > > The program crashes during "img.get_data()" with an "MemoryError" > > (having 4 GB of RAM in my machine). > > > > Any suggestions? > > are you using a 64 bit operating system? > which version of numpy? I think you want the nipy-devel mailing list for this question : http://nipy.org/nibabel/ I'm guessing that the reader is loading the raw data which is - say - int16 - and then multiplying by the scale factors to make a float64 image, which is 4 times larger. We're working on an iterative load API at the moment that might help loading the image slice by slice : https://github.com/nipy/nibabel/pull/211 It should be merged in a week or so - but it would be very helpful if you would try out the proposal to see if it helps, Best, Matthew From freddie at witherden.org Fri Jan 3 07:58:49 2014 From: freddie at witherden.org (Freddie Witherden) Date: Fri, 03 Jan 2014 12:58:49 +0000 Subject: [Numpy-discussion] Padding An Array Along A Single Axis Message-ID: <52C6B409.4090005@witherden.org> Hi all, This should be an easy one but I can not come up with a good solution. Given an ndarray with a shape of (..., X) I wish to zero-pad it to have a shape of (..., X + K), presumably obtaining a new array in the process. My best solution this far is to use np.zeros(curr.shape[:-1] + (curr.shape[-1] + K,)) followed by an assignment. However, this seems needlessly cumbersome. I looked at np.pad but it does not seem to provide a means of just padding a single axis easily. Regards, Freddie. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From joferkington at gmail.com Fri Jan 3 09:02:03 2014 From: joferkington at gmail.com (Joe Kington) Date: Fri, 3 Jan 2014 08:02:03 -0600 Subject: [Numpy-discussion] Padding An Array Along A Single Axis In-Reply-To: <52C6B409.4090005@witherden.org> References: <52C6B409.4090005@witherden.org> Message-ID: You can use np.pad for this: In [1]: import numpy as np In [2]: x = np.ones((3, 3)) In [3]: np.pad(x, [(0, 0), (0, 1)], mode='constant') Out[3]: array([[ 1., 1., 1., 0.], [ 1., 1., 1., 0.], [ 1., 1., 1., 0.]]) Each item of the pad_width (second) argument is a tuple of before, after for each axis. I've only padded the end of the last axis, but if you wanted to pad both "sides" of it: In [4]: np.pad(x, [(0, 0), (1, 1)], mode='constant') Out[4]: array([[ 0., 1., 1., 1., 0.], [ 0., 1., 1., 1., 0.], [ 0., 1., 1., 1., 0.]]) Hope that helps, -Joe On Fri, Jan 3, 2014 at 6:58 AM, Freddie Witherden wrote: > Hi all, > > This should be an easy one but I can not come up with a good solution. > Given an ndarray with a shape of (..., X) I wish to zero-pad it to have > a shape of (..., X + K), presumably obtaining a new array in the process. > > My best solution this far is to use > > np.zeros(curr.shape[:-1] + (curr.shape[-1] + K,)) > > followed by an assignment. However, this seems needlessly cumbersome. > I looked at np.pad but it does not seem to provide a means of just > padding a single axis easily. > > Regards, Freddie. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Sat Jan 4 01:00:33 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Fri, 3 Jan 2014 22:00:33 -0800 Subject: [Numpy-discussion] Quaternion type @ rosettacode.org Message-ID: Thanks Anthony and Paul! OlyDLG -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nicolas.Rougier at inria.fr Sat Jan 4 03:50:04 2014 From: Nicolas.Rougier at inria.fr (Nicolas Rougier) Date: Sat, 4 Jan 2014 09:50:04 +0100 Subject: [Numpy-discussion] ArrayList object Message-ID: <6F4DCFB4-813D-4595-BB02-2AC2CA181A57@inria.fr> Hi all, I've coding an ArrayList object based on a regular numpy array. This objects allows to dynamically append/insert/delete/access items. I found it quite convenient since it allows to manipulate an array as if it was a list with elements of different sizes but with same underlying type (=array dtype). # Creation from a nested list L = ArrayList([ [0], [1,2], [3,4,5], [6,7,8,9] ]) # Creation from an array + common item size L = ArrayList(np.ones(1000), 3) # Empty list L = ArrayList(dype=int) # Creation from an array + individual item sizes L = ArrayList(np.ones(10), 1+np.arange(4)) # Access to elements: print L[0], L[1], L[2], L[3] [0] [1 2] [3 4 5] [6 7 8 9] # Operations on elements L[:2] += 1 print L.data [1 2 3 3 4 5 6 7 8 9] Source code is available from: https://github.com/rougier/array-list I wonder is there is any interest in having such object within core numpy (np.list ?) ? Nicolas From ralf.gommers at gmail.com Sat Jan 4 10:45:14 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sat, 4 Jan 2014 16:45:14 +0100 Subject: [Numpy-discussion] proposal: min, max of complex should give warning In-Reply-To: References: Message-ID: On Tue, Dec 31, 2013 at 5:45 PM, Neal Becker wrote: > Ralf Gommers wrote: > > > On Tue, Dec 31, 2013 at 4:52 PM, Neal Becker > wrote: > > > >> Cera, Tim wrote: > >> > >> > I don't work with complex numbers, but just sampling what others do: > >> > > >> > > >> > Python: no ordering, results in TypeError > >> > > >> > Matlab: sorts by magnitude > >> > http://www.mathworks.com/help/matlab/ref/sort.html > >> > > >> > R: sorts first by real, then by imaginary > >> > http://stat.ethz.ch/R-manual/R-patched/library/base/html/sort.html > >> > > >> > Numpy: sorts first by real, then by imaginary (the documentation link > >> > below calls this sort 'lexicographical' which I don't think is > >> > correct) > >> > http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html > >> > > >> > > >> > I would think that the Matlab sort might be more useful, but easy > >> > enough by using the absolute value. > >> > > >> > I think what Numpy does is normal enough to not justify a warning, but > >> > leave this to others because as I pointed out in the beginning I don't > >> > work with complex numbers. > >> > > >> > Kindest regards, > >> > Tim > >> > >> But I'm not proposing to change numpy's result, which I'm sure would > raise > >> many > >> objections. I'm just asking to give a warning, because I think in most > >> cases > >> this is actually a mistake on the user's part. Just like the warning > >> currently > >> given when complex data are truncated to real part. > >> > > > > Keep in mind that warnings can be highly annoying. If you're a user who > > uses this functionality regularly (and you know what you're doing), then > > you're going to be very unhappy to have to wrap each function call in: > > olderr = np.seterr(all='ignore') > > max(...) > > np.seterr(**olderr) > > or in: > > with warnings.catch_warnings(): > > warnings.filterwarnings('ignore', ...) > > max(...) > > > > The actual behavior isn't documented now it looks like, so that should be > > done. In the Notes section of max/min probably. > > > > As for your proposal, it would be good to know if adding a warning would > > actually catch any bugs. For the truncation warning it caught several in > > scipy and other libs IIRC. > > > > Ralf > > I tripped over it yesterday, which is what prompted my suggestion. > That I had guessed. I meant: can you try to add this warning and then see if it catches any bugs or displays any incorrect warnings for scipy and some scikits? Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sat Jan 4 14:14:37 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sat, 4 Jan 2014 20:14:37 +0100 Subject: [Numpy-discussion] C99 compatible complex number tests fail In-Reply-To: <52B7723B.7080900@gmail.com> References: <52B7723B.7080900@gmail.com> Message-ID: On Mon, Dec 23, 2013 at 12:14 AM, Matti Picus wrote: > Hi. I started to port the stdlib cmath C99 compatible complex number > tests to numpy, after noticing that numpy seems to have different > complex number routines than cmath. The work is available on a > "retest_complex" branch of numpy > https://github.com/mattip/numpy/tree/retest_complex > The tests can be run by pulling the branch (no need to rebuild numpy) > and running > > python /numpy/core/tests/test_umath_complex.py > > test.log 2>&1 > > So far it is just a couple of commits that run the tests on numpy, I > did not dive into modifying the math routines. If I did the work > correctly, failures point to some differences, most due to edge cases > with inf and nan, but there are a number of failures due to different > finite values (for some small definition of different). > I guess my first question is "did I do the tests properly". > They work fine, however you did it in a nonstandard way which makes the output hard to read. Some comments: - the assert_* functions expect "actual" as first input and "desired" next, while you have them reversed. - it would be good to split those tests into multiple cases, for example one per function to be tested. - you shouldn't print anything, just let it fail. If you want to see each individual failure, use generator tests. - the cmathtestcases.txt is a little nonstandard but should be OK to keep it like that. Assuming I did, the next question is "are the inconsistencies > intentional" i.e. are they that way in order to be compatible with > Matlab or some other non-C99 conformant library? > The implementation should conform to IEEE 754. > > For instance, a comparison between the implementation of cmath's sqrt > and numpy's sqrt shows that numpy does not check for subnormals. I suspect no handling for denormals was done on purpose, since that should have a significant performance penalty. I'm not sure about other differences, probably just following a different reference. And I am probably mistaken since I am new to the generator methods of numpy, > but could it be that trigonometric functions like acos and acosh are > generated in umath/funcs.inc.src, using a very different algorithm than > cmathmodule.c? > You're not mistaken. > Would there be interest in a pull request that changed the routines to > be more compatible with results from cmath? > I don't think compatibility with cmath should be a goal, but if you find differences where cmath has a more accurate or faster implementation, then a PR to adopt the cmath algorithm would be very welcome. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewm at redtetrahedron.org Sat Jan 4 20:39:03 2014 From: ewm at redtetrahedron.org (Eric Moore) Date: Sat, 4 Jan 2014 20:39:03 -0500 Subject: [Numpy-discussion] C99 compatible complex number tests fail In-Reply-To: References: <52B7723B.7080900@gmail.com> Message-ID: On Saturday, January 4, 2014, Ralf Gommers wrote: > > > > On Mon, Dec 23, 2013 at 12:14 AM, Matti Picus > > wrote: > >> Hi. I started to port the stdlib cmath C99 compatible complex number >> tests to numpy, after noticing that numpy seems to have different >> complex number routines than cmath. The work is available on a >> "retest_complex" branch of numpy >> https://github.com/mattip/numpy/tree/retest_complex >> The tests can be run by pulling the branch (no need to rebuild numpy) >> and running >> >> python /numpy/core/tests/test_umath_complex.py > >> test.log 2>&1 >> >> So far it is just a couple of commits that run the tests on numpy, I >> did not dive into modifying the math routines. If I did the work >> correctly, failures point to some differences, most due to edge cases >> with inf and nan, but there are a number of failures due to different >> finite values (for some small definition of different). >> I guess my first question is "did I do the tests properly". >> > > They work fine, however you did it in a nonstandard way which makes the > output hard to read. Some comments: > - the assert_* functions expect "actual" as first input and "desired" > next, while you have them reversed. > - it would be good to split those tests into multiple cases, for example > one per function to be tested. > - you shouldn't print anything, just let it fail. If you want to see each > individual failure, use generator tests. > - the cmathtestcases.txt is a little nonstandard but should be OK to keep > it like that. > > Assuming I did, the next question is "are the inconsistencies >> intentional" i.e. are they that way in order to be compatible with >> Matlab or some other non-C99 conformant library? >> > > The implementation should conform to IEEE 754. > >> >> For instance, a comparison between the implementation of cmath's sqrt >> and numpy's sqrt shows that numpy does not check for subnormals. > > > I suspect no handling for denormals was done on purpose, since that should > have a significant performance penalty. I'm not sure about other > differences, probably just following a different reference. > > And I am probably mistaken since I am new to the generator methods of >> numpy, >> but could it be that trigonometric functions like acos and acosh are >> generated in umath/funcs.inc.src, using a very different algorithm than >> cmathmodule.c? >> > > You're not mistaken. > > >> Would there be interest in a pull request that changed the routines to >> be more compatible with results from cmath? >> > > I don't think compatibility with cmath should be a goal, but if you find > differences where cmath has a more accurate or faster implementation, then > a PR to adopt the cmath algorithm would be very welcome. > > Ralf > Have you seen https://github.com/numpy/numpy/pull/3010 ? This adds C99 compatible complex functions and tests with build time checking if the system provided functions can pass our tests. I should have some time to get back to it soon, but somemore eyes and tests and input would be good. Especially since it's not clear to me if all of the changes will be accepted. Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jan 7 13:59:59 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 7 Jan 2014 11:59:59 -0700 Subject: [Numpy-discussion] LLVM Message-ID: Has anyone tried using LLVM with Visual Studio? It is supposed to work with Visual Studio >= 2010 and might provide an alternative to MinGw64. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From cournape at gmail.com Tue Jan 7 16:49:45 2014 From: cournape at gmail.com (David Cournapeau) Date: Tue, 7 Jan 2014 21:49:45 +0000 Subject: [Numpy-discussion] LLVM In-Reply-To: References: Message-ID: On Tue, Jan 7, 2014 at 6:59 PM, Charles R Harris wrote: > Has anyone tried using LLVM with Visual Studio? It is supposed to work > with Visual Studio >= 2010 and might provide an alternative to MinGw64. > Yes, I have. It is still pretty painful to use on windows beyond simple examples, though I have not tried the new 3.4 version. See also that discussion I had with one clang dev @ apple a couple of months ago: https://twitter.com/cournape/status/381038514076655618 David > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Jan 8 13:13:20 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 08 Jan 2014 19:13:20 +0100 Subject: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array In-Reply-To: References: Message-ID: <52CD9540.3000802@googlemail.com> On 18.07.2013 15:36, Nathaniel Smith wrote: > On Wed, Jul 17, 2013 at 5:57 PM, Fr?d?ric Bastien wrote: >> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith wrote: >>>> >>>> On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith wrote: >>> It's entirely possible I misunderstood, so let's see if we can work it >>> out. I know that you want to assign to the ->data pointer in a >>> PyArrayObject, right? That's what caused some trouble with the 1.7 API >>> deprecations, which were trying to prevent direct access to this >>> field? Creating a new array given a pointer to a memory region is no >>> problem, and obviously will be supported regardless of any >>> optimizations. But if that's all you were doing then you shouldn't >>> have run into the deprecation problem. Or maybe I'm misremembering! >> >> What is currently done at only 1 place is to create a new PyArrayObject with >> a given ptr. So NumPy don't do the allocation. We later change that ptr to >> another one. > > Hmm, OK, so that would still work. If the array has the OWNDATA flag > set (or you otherwise know where the data came from), then swapping > the data pointer would still work. > > The change would be that in most cases when asking numpy to allocate a > new array from scratch, the OWNDATA flag would not be set. That's > because the OWNDATA flag really means "when this object is > deallocated, call free(self->data)", but if we allocate the array > struct and the data buffer together in a single memory region, then > deallocating the object will automatically cause the data buffer to be > deallocated as well, without the array destructor having to take any > special effort. > >> It is the change to the ptr of the just created PyArrayObject that caused >> problem with the interface deprecation. I fixed all other problem releated >> to the deprecation (mostly just rename of function/macro). But I didn't >> fixed this one yet. I would need to change the logic to compute the final >> ptr before creating the PyArrayObject object and create it with the final >> data ptr. But in call cases, NumPy didn't allocated data memory for this >> object, so this case don't block your optimization. > > Right. > >> One thing in our optimization "wish list" is to reuse allocated >> PyArrayObject between Theano function call for intermediate results(so >> completly under Theano control). This could be useful in particular for >> reshape/transpose/subtensor. Those functions are pretty fast and from >> memory, I already found the allocation time was significant. But in those >> cases, it is on PyArrayObject that are views, so the metadata and the data >> would be in different memory region in all cases. >> >> The other cases of optimization "wish list" is if we want to reuse the >> PyArrayObject when the shape isn't the good one (but the number of >> dimensions is the same). If we do that for operation like addition, we will >> need to use PyArray_Resize(). This will be done on PyArrayObject whose data >> memory was allocated by NumPy. So if you do one memory allowcation for >> metadata and data, just make sure that PyArray_Resize() will handle that >> correctly. > > I'm not sure I follow the details here, but it does turn out that a > really surprising amount of time in PyArray_NewFromDescr is spent in > just calculating and writing out the shape and strides buffers, so for > programs that e.g. use hundreds of small 3-element arrays to represent > points in space, re-using even these buffers might be a big win... > >> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, >> we where doing 2 alloc on the GPU, one for metadata and one for data. I >> removed this, as this was a bottleneck. allocation on the CPU are faster the >> on the GPU, but this is still something that is slow except if you reuse >> memory. Do PyMem_Malloc, reuse previous small allocation? > > Yes, at least in theory PyMem_Malloc is highly-optimized for small > buffer re-use. (For requests >256 bytes it just calls malloc().) And > it's possible to define type-specific freelists; not sure if there's > any value in doing that for PyArrayObjects. See Objects/obmalloc.c in > the Python source tree. > > -n PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2). I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks). This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement. [0] https://github.com/numpy/numpy/pull/4177 From nouiz at nouiz.org Wed Jan 8 14:04:38 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Wed, 8 Jan 2014 14:04:38 -0500 Subject: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array In-Reply-To: <52CD9540.3000802@googlemail.com> References: <52CD9540.3000802@googlemail.com> Message-ID: Hi, As told, I don't think Theano swap the stride buffer. Most of the time, we allocated with PyArray_empty or zeros. (not sure of the capitals). The only exception I remember have been changed in the last release to use PyArray_NewFromDescr(). Before that, we where allocating the PyArray with the right number of dimensions, then we where manually filling the ptr, shapes and strides. I don't recall any swapping of pointer for shapes and strides in Theano. So I don't see why Theano would prevent doing just one malloc for the struct and the shapes/strides. If it does, tell me and I'll fix Theano:) I don't want Theano to prevent optimization in NumPy. Theano now support completly the new NumPy C-API interface. Nathaniel also told that resizing the PyArray could prevent that. When Theano call PyArray_resize (not sure of the syntax), we always keep the number of dimensions the same. But I don't know if other code do differently. That could be a reason to keep separate alloc. I don't know any software that manually free the strides/shapes pointer to swap it. So I also think your suggestion to change PyDimMem_NEW to call the small allocator is good. The new interface prevent people from doing that anyway I think. Do we need to wait until we completly remove the old interface for this? Fred On Wed, Jan 8, 2014 at 1:13 PM, Julian Taylor wrote: > On 18.07.2013 15:36, Nathaniel Smith wrote: >> On Wed, Jul 17, 2013 at 5:57 PM, Fr?d?ric Bastien wrote: >>> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith wrote: >>>>> >>>>> On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith wrote: >>>> It's entirely possible I misunderstood, so let's see if we can work it >>>> out. I know that you want to assign to the ->data pointer in a >>>> PyArrayObject, right? That's what caused some trouble with the 1.7 API >>>> deprecations, which were trying to prevent direct access to this >>>> field? Creating a new array given a pointer to a memory region is no >>>> problem, and obviously will be supported regardless of any >>>> optimizations. But if that's all you were doing then you shouldn't >>>> have run into the deprecation problem. Or maybe I'm misremembering! >>> >>> What is currently done at only 1 place is to create a new PyArrayObject with >>> a given ptr. So NumPy don't do the allocation. We later change that ptr to >>> another one. >> >> Hmm, OK, so that would still work. If the array has the OWNDATA flag >> set (or you otherwise know where the data came from), then swapping >> the data pointer would still work. >> >> The change would be that in most cases when asking numpy to allocate a >> new array from scratch, the OWNDATA flag would not be set. That's >> because the OWNDATA flag really means "when this object is >> deallocated, call free(self->data)", but if we allocate the array >> struct and the data buffer together in a single memory region, then >> deallocating the object will automatically cause the data buffer to be >> deallocated as well, without the array destructor having to take any >> special effort. >> >>> It is the change to the ptr of the just created PyArrayObject that caused >>> problem with the interface deprecation. I fixed all other problem releated >>> to the deprecation (mostly just rename of function/macro). But I didn't >>> fixed this one yet. I would need to change the logic to compute the final >>> ptr before creating the PyArrayObject object and create it with the final >>> data ptr. But in call cases, NumPy didn't allocated data memory for this >>> object, so this case don't block your optimization. >> >> Right. >> >>> One thing in our optimization "wish list" is to reuse allocated >>> PyArrayObject between Theano function call for intermediate results(so >>> completly under Theano control). This could be useful in particular for >>> reshape/transpose/subtensor. Those functions are pretty fast and from >>> memory, I already found the allocation time was significant. But in those >>> cases, it is on PyArrayObject that are views, so the metadata and the data >>> would be in different memory region in all cases. >>> >>> The other cases of optimization "wish list" is if we want to reuse the >>> PyArrayObject when the shape isn't the good one (but the number of >>> dimensions is the same). If we do that for operation like addition, we will >>> need to use PyArray_Resize(). This will be done on PyArrayObject whose data >>> memory was allocated by NumPy. So if you do one memory allowcation for >>> metadata and data, just make sure that PyArray_Resize() will handle that >>> correctly. >> >> I'm not sure I follow the details here, but it does turn out that a >> really surprising amount of time in PyArray_NewFromDescr is spent in >> just calculating and writing out the shape and strides buffers, so for >> programs that e.g. use hundreds of small 3-element arrays to represent >> points in space, re-using even these buffers might be a big win... >> >>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, >>> we where doing 2 alloc on the GPU, one for metadata and one for data. I >>> removed this, as this was a bottleneck. allocation on the CPU are faster the >>> on the GPU, but this is still something that is slow except if you reuse >>> memory. Do PyMem_Malloc, reuse previous small allocation? >> >> Yes, at least in theory PyMem_Malloc is highly-optimized for small >> buffer re-use. (For requests >256 bytes it just calls malloc().) And >> it's possible to define type-specific freelists; not sure if there's >> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in >> the Python source tree. >> >> -n > > PyMem_Malloc is just a wrapper around malloc, so its only as optimized > as the c library is (glibc is not good for small allocations). > PyObject_Malloc uses a small object allocator for requests smaller 512 > bytes (256 in python2). > > I filed a pull request [0] replacing a few functions which I think are > safe to convert to this API. The nditer allocation which is completely > encapsulated and the construction of the scalar and array python objects > which are deleted via the tp_free slot (we really should not support > third party libraries using PyMem_Free on python objects without checks). > > This already gives up to 15% improvements for scalar operations compared > to glibc 2.17 malloc. > Do I understand the discussions here right that we could replace > PyDimMem_NEW which is used for strides in PyArray with the small object > allocation too? > It would still allow swapping the stride buffer, but every application > must then delete it with PyDimMem_FREE which should be a reasonable > requirement. > > [0] https://github.com/numpy/numpy/pull/4177 > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From ndbecker2 at gmail.com Wed Jan 8 14:12:28 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 08 Jan 2014 14:12:28 -0500 Subject: [Numpy-discussion] an indexing question Message-ID: I have a 1d vector d. I want compute the means of subsets of this vector. The subsets are selected by looking at another vector s or same shape as d. This can be done as: [np.mean (d[s == i]) for i in range (size)] But I think this could be done directly with numpy addressing, without resorting to list comprehension? From jaime.frio at gmail.com Wed Jan 8 14:32:19 2014 From: jaime.frio at gmail.com (=?ISO-8859-1?Q?Jaime_Fern=E1ndez_del_R=EDo?=) Date: Wed, 8 Jan 2014 11:32:19 -0800 Subject: [Numpy-discussion] an indexing question In-Reply-To: References: Message-ID: On Wed, Jan 8, 2014 at 11:12 AM, Neal Becker wrote: > I have a 1d vector d. I want compute the means of subsets of this vector. > The subsets are selected by looking at another vector s or same shape as d. > > This can be done as: > > [np.mean (d[s == i]) for i in range (size)] > > But I think this could be done directly with numpy addressing, without > resorting > to list comprehension? > You could get it done with np.bincount: d_sums = np.bincount(s, weights=d) d_counts = np.bincount(s) d_means = d_sums / d_counts Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Jan 8 15:40:26 2014 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 8 Jan 2014 14:40:26 -0600 Subject: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array In-Reply-To: <52CD9540.3000802@googlemail.com> References: <52CD9540.3000802@googlemail.com> Message-ID: On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor wrote: > On 18.07.2013 15:36, Nathaniel Smith wrote: >> On Wed, Jul 17, 2013 at 5:57 PM, Fr?d?ric Bastien wrote: >>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, >>> we where doing 2 alloc on the GPU, one for metadata and one for data. I >>> removed this, as this was a bottleneck. allocation on the CPU are faster the >>> on the GPU, but this is still something that is slow except if you reuse >>> memory. Do PyMem_Malloc, reuse previous small allocation? >> >> Yes, at least in theory PyMem_Malloc is highly-optimized for small >> buffer re-use. (For requests >256 bytes it just calls malloc().) And >> it's possible to define type-specific freelists; not sure if there's >> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in >> the Python source tree. > > PyMem_Malloc is just a wrapper around malloc, so its only as optimized > as the c library is (glibc is not good for small allocations). > PyObject_Malloc uses a small object allocator for requests smaller 512 > bytes (256 in python2). Right, I meant PyObject_Malloc of course. > I filed a pull request [0] replacing a few functions which I think are > safe to convert to this API. The nditer allocation which is completely > encapsulated and the construction of the scalar and array python objects > which are deleted via the tp_free slot (we really should not support > third party libraries using PyMem_Free on python objects without checks). > > This already gives up to 15% improvements for scalar operations compared > to glibc 2.17 malloc. > Do I understand the discussions here right that we could replace > PyDimMem_NEW which is used for strides in PyArray with the small object > allocation too? > It would still allow swapping the stride buffer, but every application > must then delete it with PyDimMem_FREE which should be a reasonable > requirement. That sounds reasonable to me. If we wanted to get even more elaborate, we could by default stick the shape/strides into the same allocation as the PyArrayObject, and then defer allocating a separate buffer until someone actually calls PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us whether we need to free the shape/stride buffer when deallocating the array.) It's got to be a vanishingly small proportion of arrays where PyArray_Resize is actually called, so for most arrays, this would let us skip the allocation entirely, and the only cost would be that for arrays where PyArray_Resize *is* called to add new dimensions, we'd leave the original buffers sitting around until the array was freed, wasting a tiny amount of memory. Given that no-one has noticed that currently *every* array wastes 50% of this much memory (see upthread), I doubt anyone will care... -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From nouiz at nouiz.org Wed Jan 8 15:44:41 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Wed, 8 Jan 2014 15:44:41 -0500 Subject: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array In-Reply-To: References: <52CD9540.3000802@googlemail.com> Message-ID: On Wed, Jan 8, 2014 at 3:40 PM, Nathaniel Smith wrote: > On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor > wrote: >> On 18.07.2013 15:36, Nathaniel Smith wrote: >>> On Wed, Jul 17, 2013 at 5:57 PM, Fr?d?ric Bastien wrote: >>>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, >>>> we where doing 2 alloc on the GPU, one for metadata and one for data. I >>>> removed this, as this was a bottleneck. allocation on the CPU are faster the >>>> on the GPU, but this is still something that is slow except if you reuse >>>> memory. Do PyMem_Malloc, reuse previous small allocation? >>> >>> Yes, at least in theory PyMem_Malloc is highly-optimized for small >>> buffer re-use. (For requests >256 bytes it just calls malloc().) And >>> it's possible to define type-specific freelists; not sure if there's >>> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in >>> the Python source tree. >> >> PyMem_Malloc is just a wrapper around malloc, so its only as optimized >> as the c library is (glibc is not good for small allocations). >> PyObject_Malloc uses a small object allocator for requests smaller 512 >> bytes (256 in python2). > > Right, I meant PyObject_Malloc of course. > >> I filed a pull request [0] replacing a few functions which I think are >> safe to convert to this API. The nditer allocation which is completely >> encapsulated and the construction of the scalar and array python objects >> which are deleted via the tp_free slot (we really should not support >> third party libraries using PyMem_Free on python objects without checks). >> >> This already gives up to 15% improvements for scalar operations compared >> to glibc 2.17 malloc. >> Do I understand the discussions here right that we could replace >> PyDimMem_NEW which is used for strides in PyArray with the small object >> allocation too? >> It would still allow swapping the stride buffer, but every application >> must then delete it with PyDimMem_FREE which should be a reasonable >> requirement. > > That sounds reasonable to me. > > If we wanted to get even more elaborate, we could by default stick the > shape/strides into the same allocation as the PyArrayObject, and then > defer allocating a separate buffer until someone actually calls > PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us > whether we need to free the shape/stride buffer when deallocating the > array.) It's got to be a vanishingly small proportion of arrays where > PyArray_Resize is actually called, so for most arrays, this would let > us skip the allocation entirely, and the only cost would be that for > arrays where PyArray_Resize *is* called to add new dimensions, we'd > leave the original buffers sitting around until the array was freed, > wasting a tiny amount of memory. Given that no-one has noticed that > currently *every* array wastes 50% of this much memory (see upthread), > I doubt anyone will care... Seam a good plan. When is it planed to remove the old interface? We can't do it before I think. Fred From jtaylor.debian at googlemail.com Wed Jan 8 16:39:07 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 08 Jan 2014 22:39:07 +0100 Subject: [Numpy-discussion] adding fused multiply and add to numpy Message-ID: <52CDC57B.6010507@googlemail.com> Hi, Since AMDs bulldozer and Intels Haswell x86 cpus now also support the fused-multiply-and-add operation in hardware. http://en.wikipedia.org/wiki/Multiply?accumulate_operation This operation is interesting for numpy for two reasons: - Only one rounding is involved in two arithmetic operations, this is good reducing rounding errors. - Two operations are done in one memory pass, so it improves the performance if ones operations are bound by the memory bandwidth which is very common in numpy. I have done some experiments using a custom ufunc: https://github.com/juliantaylor/npufuncs See the README.md on how to try it out. It requires a recent GCC compiler, at least 4.7 I think. It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation variants. Edit the file to select which one to use. Note if the cpu does not support the instruction it will just crash. Only the latter three are real FMA operations, the SSE2 and AVX variants just perform two regular operations in one loop. My current machine only supports SSE2, so here are the timings for it: In [25]: a = np.arange(500000.) In [26]: b = np.arange(500000.) In [27]: c = np.arange(500000.) In [28]: %timeit npfma.fma(a, b, c) 100 loops, best of 3: 2.49 ms per loop In [30]: def pure_numpy_fma(a,b,c): ....: return a * b + c In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.36 ms per loop In [32]: def pure_numpy_fma2(a,b,c): ....: tmp = a *b ....: tmp += c ....: return tmp In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 3.47 ms per loop As you can see even without real hardware support it is about 30% faster than inplace unblocked numpy due better use of memory bandwidth. Its even more than two times faster than unoptimized numpy. If you have a machine capable of fma instructions give it a spin to see if you get similar or better results. Please verify the assembly (objdump -d fma-.o) to check if the compiler properly used the machine fma. An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter. Any opinions on whether this should go into numpy or maybe stay a third party ufunc? Concerning the interface one should probably add several variants mirroring the FMA3 instruction set: http://en.wikipedia.org/wiki/Multiply?accumulate_operation additionally there is fmaddsub (a0 * b0 + c0, a1 *b1 - c1) which can be used for complex numbers, but they probably don't need an explicit numpy interface. From charlesr.harris at gmail.com Wed Jan 8 17:09:58 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 8 Jan 2014 15:09:58 -0700 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: <52CDC57B.6010507@googlemail.com> References: <52CDC57B.6010507@googlemail.com> Message-ID: On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor wrote: > Hi, > Since AMDs bulldozer and Intels Haswell x86 cpus now also support the > fused-multiply-and-add operation in hardware. > > http://en.wikipedia.org/wiki/Multiply?accumulate_operation > > This operation is interesting for numpy for two reasons: > - Only one rounding is involved in two arithmetic operations, this is > good reducing rounding errors. > - Two operations are done in one memory pass, so it improves the > performance if ones operations are bound by the memory bandwidth which > is very common in numpy. > > I have done some experiments using a custom ufunc: > https://github.com/juliantaylor/npufuncs > > See the README.md on how to try it out. It requires a recent GCC > compiler, at least 4.7 I think. > > It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation > variants. Edit the file to select which one to use. Note if the cpu does > not support the instruction it will just crash. > Only the latter three are real FMA operations, the SSE2 and AVX variants > just perform two regular operations in one loop. > > My current machine only supports SSE2, so here are the timings for it: > > In [25]: a = np.arange(500000.) > In [26]: b = np.arange(500000.) > In [27]: c = np.arange(500000.) > > In [28]: %timeit npfma.fma(a, b, c) > 100 loops, best of 3: 2.49 ms per loop > > In [30]: def pure_numpy_fma(a,b,c): > ....: return a * b + c > > In [31]: %timeit pure_numpy_fma(a, b, c) > 100 loops, best of 3: 7.36 ms per loop > > > In [32]: def pure_numpy_fma2(a,b,c): > ....: tmp = a *b > ....: tmp += c > ....: return tmp > > In [33]: %timeit pure_numpy_fma2(a, b, c) > 100 loops, best of 3: 3.47 ms per loop > > > As you can see even without real hardware support it is about 30% faster > than inplace unblocked numpy due better use of memory bandwidth. Its > even more than two times faster than unoptimized numpy. > > If you have a machine capable of fma instructions give it a spin to see > if you get similar or better results. Please verify the assembly > (objdump -d fma-.o) to check if the compiler properly used the > machine fma. > > An issue is software emulation of real fma. This can be enabled in the > test ufunc with npfma.set_type("libc"). > This is unfortunately incredibly slow about a factor 300 on my machine > without hardware fma. > This means we either have a function that is fast on some platforms and > slow on others but always gives the same result or we have a fast > function that gives better results on some platforms. > Given that we are not worth that what numpy currently provides I favor > the latter. > > Any opinions on whether this should go into numpy or maybe stay a third > party ufunc? > Multiply and add is a standard function that I think would be good to have in numpy. Not only does it save on memory accesses, it saves on temporary arrays. Another function that could be useful is a |a|**2 function, abs2 perhaps. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Jan 9 09:35:22 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 09 Jan 2014 09:35:22 -0500 Subject: [Numpy-discussion] adding fused multiply and add to numpy References: <52CDC57B.6010507@googlemail.com> Message-ID: Charles R Harris wrote: > On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor > wrote: > ... > > Another function that could be useful is a |a|**2 function, abs2 perhaps. > > Chuck I use mag_sqr all the time. It should be much faster for complex, if computed via: x.real**2 + x.imag**2 avoiding the sqrt of abs. From freddie at witherden.org Thu Jan 9 09:43:07 2014 From: freddie at witherden.org (Freddie Witherden) Date: Thu, 09 Jan 2014 14:43:07 +0000 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: <52CDC57B.6010507@googlemail.com> References: <52CDC57B.6010507@googlemail.com> Message-ID: <52CEB57B.1090504@witherden.org> On 08/01/14 21:39, Julian Taylor wrote: > An issue is software emulation of real fma. This can be enabled in the > test ufunc with npfma.set_type("libc"). > This is unfortunately incredibly slow about a factor 300 on my machine > without hardware fma. > This means we either have a function that is fast on some platforms and > slow on others but always gives the same result or we have a fast > function that gives better results on some platforms. > Given that we are not worth that what numpy currently provides I favor > the latter. > > Any opinions on whether this should go into numpy or maybe stay a third > party ufunc? My preference would be to initially add an "madd" intrinsic. This can be supported on all platforms and can be documented to permit the use of FMA where available. A 'true' FMA intrinsic function should only be provided when hardware FMA support is available. Many of the more interesting applications of FMA depend on there only being a single rounding step and as such "FMA" should probably mean "a*b + c with only a single rounding". Regards, Freddie. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From nouiz at nouiz.org Thu Jan 9 09:50:55 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Thu, 9 Jan 2014 09:50:55 -0500 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: <52CEB57B.1090504@witherden.org> References: <52CDC57B.6010507@googlemail.com> <52CEB57B.1090504@witherden.org> Message-ID: Hi, It happen frequently that NumPy isn't compiled with all instruction that is available where it run. For example in distro. So if the decision is made to use the fast version when we don't use the newer instruction, the user need a way to know that. So the library need a function/attribute to tell that. How hard would it be to provide the choise to the user? We could provide 2 functions like: fma_fast() fma_prec() (for precision)? Or this could be a parameter or a user configuration option like for the overflow/underflow error. Fred On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden wrote: > On 08/01/14 21:39, Julian Taylor wrote: >> An issue is software emulation of real fma. This can be enabled in the >> test ufunc with npfma.set_type("libc"). >> This is unfortunately incredibly slow about a factor 300 on my machine >> without hardware fma. >> This means we either have a function that is fast on some platforms and >> slow on others but always gives the same result or we have a fast >> function that gives better results on some platforms. >> Given that we are not worth that what numpy currently provides I favor >> the latter. >> >> Any opinions on whether this should go into numpy or maybe stay a third >> party ufunc? > > My preference would be to initially add an "madd" intrinsic. This can > be supported on all platforms and can be documented to permit the use of > FMA where available. > > A 'true' FMA intrinsic function should only be provided when hardware > FMA support is available. Many of the more interesting applications of > FMA depend on there only being a single rounding step and as such "FMA" > should probably mean "a*b + c with only a single rounding". > > Regards, Freddie. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From davidmenhur at gmail.com Thu Jan 9 09:54:42 2014 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Thu, 9 Jan 2014 15:54:42 +0100 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: <52CDC57B.6010507@googlemail.com> References: <52CDC57B.6010507@googlemail.com> Message-ID: On 8 January 2014 22:39, Julian Taylor wrote: > As you can see even without real hardware support it is about 30% faster > than inplace unblocked numpy due better use of memory bandwidth. Its > even more than two times faster than unoptimized numpy. > I have an i5, and AVX crashes, even though it is supported by my CPU. Here are my timings: SSE2: In [24]: %timeit npfma.fma(a, b, c) 100000 loops, best of 3: 15 us per loop In [28]: %timeit npfma.fma(a, b, c) 100 loops, best of 3: 2.36 ms per loop In [29]: %timeit npfma.fms(a, b, c) 100 loops, best of 3: 2.36 ms per loop In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.5 ms per loop In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 4.41 ms per loop The model supports all the way to sse4_2 libc: In [24]: %timeit npfma.fma(a, b, c) 1000 loops, best of 3: 883 us per loop In [28]: %timeit npfma.fma(a, b, c) 10 loops, best of 3: 88.7 ms per loop In [29]: %timeit npfma.fms(a, b, c) 10 loops, best of 3: 87.4 ms per loop In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.94 ms per loop In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 3.03 ms per loop > If you have a machine capable of fma instructions give it a spin to see > if you get similar or better results. Please verify the assembly > (objdump -d fma-.o) to check if the compiler properly used the > machine fma. > Following the instructions in the readme, there is only one compiled file, npfma.so, but no .o. /David. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Thu Jan 9 10:18:03 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 9 Jan 2014 16:18:03 +0100 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: References: <52CDC57B.6010507@googlemail.com> Message-ID: On Thu, Jan 9, 2014 at 3:54 PM, Da?id wrote: > > On 8 January 2014 22:39, Julian Taylor wrote: > >> As you can see even without real hardware support it is about 30% faster >> than inplace unblocked numpy due better use of memory bandwidth. Its >> even more than two times faster than unoptimized numpy. >> > > I have an i5, and AVX crashes, even though it is supported by my CPU. > I forgot about the 32 byte alignment avx (as it is used in this code) requires. I pushed a new version that takes care of it. It should now work with avx. > Following the instructions in the readme, there is only one compiled file, > npfma.so, but no .o. > > > the .o files are in the build/ subfolder -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Thu Jan 9 10:30:14 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 9 Jan 2014 16:30:14 +0100 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: References: <52CDC57B.6010507@googlemail.com> <52CEB57B.1090504@witherden.org> Message-ID: On Thu, Jan 9, 2014 at 3:50 PM, Fr?d?ric Bastien wrote: > Hi, > > It happen frequently that NumPy isn't compiled with all instruction > that is available where it run. For example in distro. So if the > decision is made to use the fast version when we don't use the newer > instruction, the user need a way to know that. So the library need a > function/attribute to tell that. > As these instructions are very new runtime cpu feature detection is required. That way also distribution users get the fast code if their cpu supports it. > > How hard would it be to provide the choise to the user? We could > provide 2 functions like: fma_fast() fma_prec() (for precision)? Or > this could be a parameter or a user configuration option like for the > overflow/underflow error. > I like Freddie Witherden proposal to name the function madd which does not guarantee one rounding operation. This leaves the namespace open for a special fma function with that guarantee. It can use the libc fma function which is very slow sometimes but platform independent. This is assuming apple did not again take shortcuts like they did with their libc hypot implementation, can someone disassemble apple libc to check what they are doing for C99 fma? And it leaves users the possibility to use the faster madd function if they do not need the precision guarantee. Another option would be a precision context manager which tells numpy which variant to use. This would also be useful for other code (like abs/hypot/abs2/sum/reciprocal sqrt) but probably it involves more work. with numpy.precision_mode('fast'): ... # allow no fma, use fast hypot, fast sum, ignore overflow/invalid errors with numpy.precision_mode('precise'): ... # require fma, use precise hypot, use exact summation (math.fsum) or at least kahan summation, full overflow/invalid checks etc > > > On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden > wrote: > > On 08/01/14 21:39, Julian Taylor wrote: > >> An issue is software emulation of real fma. This can be enabled in the > >> test ufunc with npfma.set_type("libc"). > >> This is unfortunately incredibly slow about a factor 300 on my machine > >> without hardware fma. > >> This means we either have a function that is fast on some platforms and > >> slow on others but always gives the same result or we have a fast > >> function that gives better results on some platforms. > >> Given that we are not worth that what numpy currently provides I favor > >> the latter. > >> > >> Any opinions on whether this should go into numpy or maybe stay a third > >> party ufunc? > > > > My preference would be to initially add an "madd" intrinsic. This can > > be supported on all platforms and can be documented to permit the use of > > FMA where available. > > > > A 'true' FMA intrinsic function should only be provided when hardware > > FMA support is available. Many of the more interesting applications of > > FMA depend on there only being a single rounding step and as such "FMA" > > should probably mean "a*b + c with only a single rounding". > > > > Regards, Freddie. > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Jan 9 12:07:00 2014 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 9 Jan 2014 17:07:00 +0000 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: References: <52CDC57B.6010507@googlemail.com> <52CEB57B.1090504@witherden.org> Message-ID: On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor wrote: > On Thu, Jan 9, 2014 at 3:50 PM, Fr?d?ric Bastien wrote: >> How hard would it be to provide the choise to the user? We could >> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or >> this could be a parameter or a user configuration option like for the >> overflow/underflow error. > > I like Freddie Witherden proposal to name the function madd which does not > guarantee one rounding operation. This leaves the namespace open for a > special fma function with that guarantee. It can use the libc fma function > which is very slow sometimes but platform independent. This is assuming > apple did not again take shortcuts like they did with their libc hypot > implementation, can someone disassemble apple libc to check what they are > doing for C99 fma? > And it leaves users the possibility to use the faster madd function if they > do not need the precision guarantee. If madd doesn't provide any rounding guarantees, then its only reason for existence is that it provides a fused a*b+c loop that better utilizes memory bandwidth, right? I'm guessing that speed-wise it doesn't really matter whether you use the fancy AVX instructions or not, since even the naive implementation is memory bound -- the advantage is just in the fusion? Lack of loop fusion is obviously a major limitation of numpy, but it's a very general problem. I'm sceptical about whether we want to get into the business of adding functions whose only purpose is to provide pre-fused loops. After madd, what other operations should we provide like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)? How do we decide? Surely it's better to direct people who are hitting memory bottlenecks to much more powerful and general solutions to this problem, like numexpr/cython/numba/theano? (OTOH the verison that gives rounding guarantees is obviously a unique new feature.) -n From jaime.frio at gmail.com Thu Jan 9 13:32:29 2014 From: jaime.frio at gmail.com (=?ISO-8859-1?Q?Jaime_Fern=E1ndez_del_R=EDo?=) Date: Thu, 9 Jan 2014 10:32:29 -0800 Subject: [Numpy-discussion] ENH: add a 'return_counts=' keyword argument to `np.unique` Message-ID: Hi, I have just sent a PR, adding a `return_counts` keyword argument to `np.unique` that does exactly what the name suggests: counting the number of times each unique time comes up in the array. It reuses the `flag` array that is constructed whenever any optional index is requested, extracts the indices of the `True`s in it, and returns their diff. You can check it here: https://github.com/numpy/numpy/pull/4180 Regards, Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Jan 9 18:21:17 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 9 Jan 2014 16:21:17 -0700 Subject: [Numpy-discussion] Memory allocation cleanup Message-ID: Apropos Julian's changes to use the PyObject_* allocation suite for some parts of numpy, I posted the following I think numpy memory management is due a cleanup. Currently we have PyDataMem_* PyDimMem_* PyArray_* Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to manage heap allocations. As far as I can tell, PyArray_* is always PyMem_*in practice. We probably need to keep the PyDataMem family as it has a memory tracking option, but PyDimMem just confuses things, I'd rather just use PyMem_* with explicit size. Curiously, the PyObject_Malloc family is not documented apart from some release notes. We should also check for the macro versions of PyMem_* as they are deprecated for extension modules. Nathaniel then suggested that we consider going all Python allocators, especially as new memory tracing tools are coming online in 3.4. Given that these changes could have some impact on current extension writers I thought I'd bring this up on the list to gather opinions. Thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nouiz at nouiz.org Thu Jan 9 19:35:32 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Thu, 9 Jan 2014 19:35:32 -0500 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: This shouldn't affect Theano. So I have no objection. Making thing faster and more tracktable is always good. So I think it seam a good idea. Fred On Thu, Jan 9, 2014 at 6:21 PM, Charles R Harris wrote: > Apropos Julian's changes to use the PyObject_* allocation suite for some > parts of numpy, I posted the following > > I think numpy memory management is due a cleanup. Currently we have > > PyDataMem_* > PyDimMem_* > PyArray_* > > Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to > manage heap allocations. As far as I can tell, PyArray_* is always PyMem_* > in practice. We probably need to keep the PyDataMem family as it has a > memory tracking option, but PyDimMem just confuses things, I'd rather just > use PyMem_* with explicit size. Curiously, the PyObject_Malloc family is not > documented apart from some release notes. > > We should also check for the macro versions of PyMem_* as they are > deprecated for extension modules. > > Nathaniel then suggested that we consider going all Python allocators, > especially as new memory tracing tools are coming online in 3.4. Given that > these changes could have some impact on current extension writers I thought > I'd bring this up on the list to gather opinions. > > Thoughts? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From nouiz at nouiz.org Thu Jan 9 19:49:00 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Thu, 9 Jan 2014 19:49:00 -0500 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: References: <52CDC57B.6010507@googlemail.com> <52CEB57B.1090504@witherden.org> Message-ID: Good questions where do we stop. I think as you that the fma with guarantees is a good new feature. But if this is made available, people will want to use it for speed. Some people won't like to use another library or dependency. They won't like to have random speed up or slow down. So why not add the ma and fma and trace the line to the operation implemented on the CPU that have an fused version? That will make a sensible limit I think. Anyway, we won't use it directly. This is just my taught. Do you know if those instruction are automatically used by gcc if we use the good architecture parameter? Fred On Thu, Jan 9, 2014 at 12:07 PM, Nathaniel Smith wrote: > On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor > wrote: >> On Thu, Jan 9, 2014 at 3:50 PM, Fr?d?ric Bastien wrote: >>> How hard would it be to provide the choise to the user? We could >>> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or >>> this could be a parameter or a user configuration option like for the >>> overflow/underflow error. >> >> I like Freddie Witherden proposal to name the function madd which does not >> guarantee one rounding operation. This leaves the namespace open for a >> special fma function with that guarantee. It can use the libc fma function >> which is very slow sometimes but platform independent. This is assuming >> apple did not again take shortcuts like they did with their libc hypot >> implementation, can someone disassemble apple libc to check what they are >> doing for C99 fma? >> And it leaves users the possibility to use the faster madd function if they >> do not need the precision guarantee. > > If madd doesn't provide any rounding guarantees, then its only reason > for existence is that it provides a fused a*b+c loop that better > utilizes memory bandwidth, right? I'm guessing that speed-wise it > doesn't really matter whether you use the fancy AVX instructions or > not, since even the naive implementation is memory bound -- the > advantage is just in the fusion? > > Lack of loop fusion is obviously a major limitation of numpy, but it's > a very general problem. I'm sceptical about whether we want to get > into the business of adding functions whose only purpose is to provide > pre-fused loops. After madd, what other operations should we provide > like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)? > How do we decide? Surely it's better to direct people who are hitting > memory bottlenecks to much more powerful and general solutions to this > problem, like numexpr/cython/numba/theano? > > (OTOH the verison that gives rounding guarantees is obviously a unique > new feature.) > > -n > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jtaylor.debian at googlemail.com Thu Jan 9 20:06:01 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 10 Jan 2014 02:06:01 +0100 Subject: [Numpy-discussion] adding fused multiply and add to numpy In-Reply-To: References: <52CDC57B.6010507@googlemail.com> <52CEB57B.1090504@witherden.org> Message-ID: <52CF4779.3050403@googlemail.com> On 10.01.2014 01:49, Fr?d?ric Bastien wrote: > > Do you know if those instruction are automatically used by gcc if we > use the good architecture parameter? > > they are used if you enable -ffp-contract=fast. Do not set it to `on` this is an alias to `off` due to the semantics of C. -ffast-math enables in in gcc 4.7 and 4.8 but not in 4.9 but this might be a bug, I filed one a while ago. Also you need to set the -mfma or -arch=bdver{1,2,3,4}. Its not part of -mavx2 last I checked. But there are not many places in numpy the compiler can use it, only dot comes to mind which goes over blas libraries in the high performance case. From njs at pobox.com Thu Jan 9 21:48:25 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 10 Jan 2014 02:48:25 +0000 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: On Thu, Jan 9, 2014 at 11:21 PM, Charles R Harris wrote: > Apropos Julian's changes to use the PyObject_* allocation suite for some > parts of numpy, I posted the following > > I think numpy memory management is due a cleanup. Currently we have > > PyDataMem_* > PyDimMem_* > PyArray_* > > Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to > manage heap allocations. As far as I can tell, PyArray_* is always PyMem_* > in practice. We probably need to keep the PyDataMem family as it has a > memory tracking option, but PyDimMem just confuses things, I'd rather just > use PyMem_* with explicit size. Curiously, the PyObject_Malloc family is not > documented apart from some release notes. > > We should also check for the macro versions of PyMem_* as they are > deprecated for extension modules. > > Nathaniel then suggested that we consider going all Python allocators, > especially as new memory tracing tools are coming online in 3.4. Given that > these changes could have some impact on current extension writers I thought > I'd bring this up on the list to gather opinions. > > Thoughts? After a bit more research, some further points to keep in mind: Currently, PyDimMem_* and PyArray_* are just aliases for malloc/free, and PyDataMem_* is an alias for malloc/free with some extra tracing hooks wrapped around it. (AFAIK, these tracing hooks are not used by anyone anywhere -- at least, if they are I haven't heard about it, and there is no code on github that uses them.) There is one substantial difference between the PyMem_* and PyObject_* interfaces as compared to malloc(), which is that the Py* interfaces require that the GIL be held when they are called. (@Julian -- I think your PR we just merged fulfills this requirement, is that right?) I strongly suspect that we have PyDataMem_* calls outside of the GIL -- e.g., when allocating ufunc buffers -- and third-party code might as well. Python 3.4's new memory allocation API and tracing stuff is documented here: http://www.python.org/dev/peps/pep-0445/ http://docs.python.org/dev/c-api/memory.html http://docs.python.org/dev/library/tracemalloc.html In particular, 3.4 adds a set of PyRawMem_* functions, which do not require the GIL. Checking the current source code for _tracemalloc.c, it appears that PyRawMem_* functions *are* traced, so that's nice - that means that switching PyDataMem_* to use PyRawMem_* would be both safe and provide benefits. However, PyRawMem_* does not provide the pymalloc optimizations for small allocations. Also, none of the Py* interfaces implement calloc(), which is annoying because it messes up our new optimization of using calloc() for np.zeros. (calloc() is generally faster than malloc()+explicit zeroing, because it can use OS-specific virtual memory tricks to zero out the memory "for free". These same tricks also mean that if you use np.zeros() to allocate a large array, and then only write to a few entries in that array, the total memory used is proportional to the number of non-zero entries, rather than to the actual size of the array, which can be extremely useful in some situations as a kind of "poor man's sparse array".) I'm pretty sure that the vast majority of our allocations do occur with GIL protection, so we might want to switch to using PyObject_* for most cases to take advantage of the small-object optimizations, and use PyRawMem_* for any non-GIL cases (like possibly ufunc buffers), with a compatibility wrapper to replace PyRawMem_* with malloc() on pre-3.4 pythons. Of course this will need some profiling to see if PyObject_* is actually better than malloc() in practice. For calloc(), we could try and convince python-dev to add this, or np.zeros() could explicitly use calloc() even when other code uses Py* interface and then uses an ndarray flag or special .base object to keep track of the fact that we need to use free() to deallocate this memory, or we could give up on the calloc optimization. -n From jtaylor.debian at googlemail.com Fri Jan 10 04:18:05 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 10 Jan 2014 10:18:05 +0100 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: On Fri, Jan 10, 2014 at 3:48 AM, Nathaniel Smith wrote: > On Thu, Jan 9, 2014 at 11:21 PM, Charles R Harris > wrote: > > [...] > > After a bit more research, some further points to keep in mind: > > Currently, PyDimMem_* and PyArray_* are just aliases for malloc/free, > and PyDataMem_* is an alias for malloc/free with some extra tracing > hooks wrapped around it. (AFAIK, these tracing hooks are not used by > anyone anywhere -- at least, if they are I haven't heard about it, and > there is no code on github that uses them.) > There is one substantial difference between the PyMem_* and PyObject_* > interfaces as compared to malloc(), which is that the Py* interfaces > require that the GIL be held when they are called. (@Julian -- I think > your PR we just merged fulfills this requirement, is that right?) I only replaced object allocation which should always be called under GIL, not sure about nditer construction, but it does uses python exceptions for errors which I think also require the GIL. [...] > > Also, none of the Py* interfaces implement calloc(), which is annoying > because it messes up our new optimization of using calloc() for > np.zeros. [...] > Another thing that is not directly implemented in Python is aligned allocation. This is going to get increasingly important with the advent heavily vectorized x86 CPUs (e.g. AVX512 is rolling out now) and the C malloc being optimized for the oldish SSE (16 bytes). I want to change the array buffer allocation to make use of posix_memalign and C11 aligned_malloc if available to avoid some penalties when loading from non 32 byte aligned buffers. I could imagine it might also help coprocessors and gpus to have higher alignments, but I'm not very familiar with that type of hardware. The allocator used by the Python3.4 is plugable, so we could implement our special allocators with the new API, but only when 3.4 is more widespread. For this reason and missing calloc I don't think we should use the Python API for data buffers just yet. Any benefits are relatively small anyway. [...] > > I'm pretty sure that the vast majority of our allocations do occur > with GIL protection, so we might want to switch to using PyObject_* > for most cases to take advantage of the small-object optimizations, > and use PyRawMem_* for any non-GIL cases (like possibly ufunc > buffers), with a compatibility wrapper to replace PyRawMem_* with > malloc() on pre-3.4 pythons. Of course this will need some profiling > to see if PyObject_* is actually better than malloc() in practice. I don't think its required to replace everything with PyObject_* just because it can be faster. We should do it only in places where it really makes a difference and there are not that many of them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nouiz at nouiz.org Fri Jan 10 09:52:23 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Fri, 10 Jan 2014 09:52:23 -0500 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: On Fri, Jan 10, 2014 at 4:18 AM, Julian Taylor wrote: > On Fri, Jan 10, 2014 at 3:48 AM, Nathaniel Smith wrote: >> >> On Thu, Jan 9, 2014 at 11:21 PM, Charles R Harris >> wrote: >> > [...] >> >> After a bit more research, some further points to keep in mind: >> >> Currently, PyDimMem_* and PyArray_* are just aliases for malloc/free, >> and PyDataMem_* is an alias for malloc/free with some extra tracing >> hooks wrapped around it. (AFAIK, these tracing hooks are not used by >> anyone anywhere -- at least, if they are I haven't heard about it, and >> there is no code on github that uses them.) >> >> >> There is one substantial difference between the PyMem_* and PyObject_* >> interfaces as compared to malloc(), which is that the Py* interfaces >> require that the GIL be held when they are called. (@Julian -- I think >> your PR we just merged fulfills this requirement, is that right?) > > > I only replaced object allocation which should always be called under GIL, > not sure about nditer construction, but it does uses python exceptions for > errors which I think also require the GIL. > > [...] >> >> >> Also, none of the Py* interfaces implement calloc(), which is annoying >> because it messes up our new optimization of using calloc() for >> np.zeros. [...] > > > Another thing that is not directly implemented in Python is aligned > allocation. This is going to get increasingly important with the advent > heavily vectorized x86 CPUs (e.g. AVX512 is rolling out now) and the C > malloc being optimized for the oldish SSE (16 bytes). I want to change the > array buffer allocation to make use of posix_memalign and C11 aligned_malloc > if available to avoid some penalties when loading from non 32 byte aligned > buffers. I could imagine it might also help coprocessors and gpus to have > higher alignments, but I'm not very familiar with that type of hardware. > The allocator used by the Python3.4 is plugable, so we could implement our > special allocators with the new API, but only when 3.4 is more widespread. About the co-processor and GPUs, it could help, but as NumPy is CPU only and that there is other problem in directly using it, I dought that this change would help code around co-processor/GPUs. Fred From njs at pobox.com Fri Jan 10 11:03:11 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 10 Jan 2014 16:03:11 +0000 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: On Fri, Jan 10, 2014 at 9:18 AM, Julian Taylor wrote: > On Fri, Jan 10, 2014 at 3:48 AM, Nathaniel Smith wrote: >> >> Also, none of the Py* interfaces implement calloc(), which is annoying >> because it messes up our new optimization of using calloc() for >> np.zeros. [...] > > > Another thing that is not directly implemented in Python is aligned > allocation. This is going to get increasingly important with the advent > heavily vectorized x86 CPUs (e.g. AVX512 is rolling out now) and the C > malloc being optimized for the oldish SSE (16 bytes). I want to change the > array buffer allocation to make use of posix_memalign and C11 aligned_malloc > if available to avoid some penalties when loading from non 32 byte aligned > buffers. I could imagine it might also help coprocessors and gpus to have > higher alignments, but I'm not very familiar with that type of hardware. > The allocator used by the Python3.4 is plugable, so we could implement our > special allocators with the new API, but only when 3.4 is more widespread. > > For this reason and missing calloc I don't think we should use the Python > API for data buffers just yet. Any benefits are relatively small anyway. It really would be nice if our data allocations would all be visible to the tracemalloc library though, somehow. And I doubt we want to patch *all* Python allocations to go through posix_memalign, both because this is rather intrusive and because it would break python -X tracemalloc. How certain are we that we want to switch to aligned allocators in the future? If we don't, then maybe it makes to ask python-dev for a calloc interface; but if we do, then I doubt we can convince them to add aligned allocation interfaces, and we'll need to ask for something else (maybe a "null" allocator, which just notifies the python memory tracking machinery that we allocated something ourselves?). It's not obvious to me why aligning data buffers is useful - can you elaborate? There's no code simplification, because we always have to handle the unaligned case anyway with the standard unaligned startup/cleanup loops. And intuitively, given the existence of such loops, alignment shouldn't matter much in practice, since the most that shifting alignment can do is change the number of elements that need to be handled by such loops by (SIMD alignment value / element size). For doubles, in a buffer that has 16 byte alignment but not 32 byte alignment, this means that worst case, we end up doing 4 unnecessary non-SIMD operations. And surely that only matters for very small arrays (for large arrays such constant overhead will amortize out), but for small arrays SIMD doesn't help much anyway? Probably I'm missing something, because you actually know something about SIMD and I'm just hand-waving from first principles :-). But it'd be nice to understand the reasoning for why/whether alignment really helps in the numpy context. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From lists at hilboll.de Fri Jan 10 11:03:33 2014 From: lists at hilboll.de (Andreas Hilboll) Date: Fri, 10 Jan 2014 17:03:33 +0100 Subject: [Numpy-discussion] Why do weights in np.polyfit have to be 1D? Message-ID: <52D019D5.4020902@hilboll.de> Hi, in using np.polyfit (in version 1.7.1), I ran accross TypeError: expected a 1-d array for weights when trying to fit k polynomials at once (x.shape = (4, ), y.shape = (4, 136), w.shape = (4, 136)). Is there any specific reason why this is not supported? -- Andreas. From charlesr.harris at gmail.com Fri Jan 10 12:02:01 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 10 Jan 2014 10:02:01 -0700 Subject: [Numpy-discussion] Why do weights in np.polyfit have to be 1D? In-Reply-To: <52D019D5.4020902@hilboll.de> References: <52D019D5.4020902@hilboll.de> Message-ID: On Fri, Jan 10, 2014 at 9:03 AM, Andreas Hilboll wrote: > Hi, > > in using np.polyfit (in version 1.7.1), I ran accross > > TypeError: expected a 1-d array for weights > > when trying to fit k polynomials at once (x.shape = (4, ), y.shape = (4, > 136), w.shape = (4, 136)). Is there any specific reason why this is not > supported? > The weights are applied to the rows of the design matrix, so if you have multiple weight vectors you essentially need to iterate the fit over them. Said differently, for each weight vector there is a generalized inverse and if there is a different weight vector for each column of the rhs, then there is a different generalized inverse for each column. You can't just multiply the rhs from the left by *the* inverse. The problem doesn't vectorize. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nicolas.Rougier at inria.fr Fri Jan 10 12:37:38 2014 From: Nicolas.Rougier at inria.fr (Nicolas Rougier) Date: Fri, 10 Jan 2014 18:37:38 +0100 Subject: [Numpy-discussion] Bug in resize of structured array (with initial size = 0) Message-ID: <722576BC-22A0-418F-A039-65F44B835784@inria.fr> Hi, I've tried to resize a record array that was first empty (on purpose, I need it) and I got the following error (while it's working for regular array). Traceback (most recent call last): File "test_resize.py", line 10, in print np.resize(V,2) File "/usr/locaL/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 1053, in resize if not Na: return mu.zeros(new_shape, a.dtype.char) TypeError: Empty data-type I'm using numpy 1.8.0, python 2.7.6, osx 10.9.1. Can anyone confirm before I submit an issue ? Here is the script: V = np.zeros(0, dtype=np.float32) print V.dtype print np.resize(V,2) V = np.zeros(0, dtype=[('a', np.float32, 1)]) print V.dtype print np.resize(V,2) From jtaylor.debian at googlemail.com Fri Jan 10 14:15:26 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 10 Jan 2014 20:15:26 +0100 Subject: [Numpy-discussion] Memory allocation cleanup In-Reply-To: References: Message-ID: <52D046CE.4050409@googlemail.com> On 10.01.2014 17:03, Nathaniel Smith wrote: > On Fri, Jan 10, 2014 at 9:18 AM, Julian Taylor > wrote: >> On Fri, Jan 10, 2014 at 3:48 AM, Nathaniel Smith wrote: >>> [...] >> >> For this reason and missing calloc I don't think we should use the Python >> API for data buffers just yet. Any benefits are relatively small anyway. > > It really would be nice if our data allocations would all be visible > to the tracemalloc library though, somehow. And I doubt we want to > patch *all* Python allocations to go through posix_memalign, both > because this is rather intrusive and because it would break python -X > tracemalloc. we can most likely plug aligned allocators into the python allocator to still be able to use tracemalloc but it would be python3.4 only [0], older versions would continue to use our aligned allocators directly with our own tracing. I think thats fine, I doubt the tracemalloc module will be backported to older pythons. An issue is we can't fit calloc in there without abusing one of the domains, but I think it is also not so critical to keep it. The sparseness is neat but you can lose it very quickly again too (basically on any full copy) and its not portable. > > How certain are we that we want to switch to aligned allocators in the > future? If we don't, then maybe it makes to ask python-dev for a > calloc interface; but if we do, then I doubt we can convince them to > add aligned allocation interfaces, and we'll need to ask for something > else (maybe a "null" allocator, which just notifies the python memory > tracking machinery that we allocated something ourselves?). > > It's not obvious to me why aligning data buffers is useful - can you > elaborate? There's no code simplification, because we always have to > handle the unaligned case anyway with the standard unaligned > startup/cleanup loops. And intuitively, given the existence of such > loops, alignment shouldn't matter much in practice, since the most > that shifting alignment can do is change the number of elements that > need to be handled by such loops by (SIMD alignment value / element > size). For doubles, in a buffer that has 16 byte alignment but not 32 > byte alignment, this means that worst case, we end up doing 4 > unnecessary non-SIMD operations. Its relevant when you have multiple buffer inputs. If they do not have the same alignment they can't be all peeled to a correct alignment, some of the inputs will always have be loaded unaligned. It might be that in modern x86 hardware unaligned loads might be cheaper. In Nehalem architectures using unaligned instructions have almost no penalty if the underlying memory is in fact aligned correctly, but there is still a penalty if it is not aligned. I'm not sure how relevant that is in the even newer architectures, the intel docs still recommend aligning memory though. [0] http://www.python.org/dev/peps/pep-0445/ From hedieh.ebrahimi at amphos21.com Wed Jan 15 05:12:42 2014 From: hedieh.ebrahimi at amphos21.com (Hedieh Ebrahimi) Date: Wed, 15 Jan 2014 11:12:42 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array Message-ID: Hello, I am trying to use the following line of code : fileContent=loadtxt(filePath,dtype=str) in order to load a text file located at path= filePath in to a numpy array called fileContent. I?ve simplifed my file for the purpose of this question but the file looks something like this: file Content : C:\Users\Documents\Project\mytextfile1.txt C:\Users\Documents\Project\mytextfile2.txt C:\Users\Documents\Project\mytextfile3.txt I try to print my fileContent array after I read it and it looks like this : ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] Why is this happening and how can I prevent it ? Also if I have a line that starts like this in my file, python will crash on me. how can i fix this ? !--Timestep ( line in file starting with !-- ) I guess it has to have something to do with datatype. if I donot define the datatype it will be float by default which will give me an error an if I define the datatype as string as I did above, then I get to the problems that I mentioned above. I?d appreciate any help on how to fix this. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidmenhur at gmail.com Wed Jan 15 05:25:26 2014 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Wed, 15 Jan 2014 11:25:26 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: Message-ID: On 15 January 2014 11:12, Hedieh Ebrahimi wrote: > I try to print my fileContent array after I read it and it looks like this > : > > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] > > Why is this happening and how can I prevent it ? > Also if I have a line that starts like this in my file, python will crash > on me. how can i fix this ? > What is wrong with this case? If you are concerned about the multiple backslashes, they are there because they are special symbols, and so they have to be escaped (you actually want a backslash, not whatever else they could mean). Depending on what else is on the file, you may be better off reading the file in pure python. Assuming there is nothing else, something like this would work: [line.strip() for line in open(filePath, 'r').readlines()] /David. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Jan 15 07:38:57 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 15 Jan 2014 13:38:57 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: Message-ID: <52D68161.7090807@googlemail.com> On 01/15/2014 11:25 AM, Da?id wrote: > On 15 January 2014 11:12, Hedieh Ebrahimi > wrote: > > I try to print my fileContent array after I read it and it looks > like this : > > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] > > Why is this happening and how can I prevent it ? > Also if I have a line that starts like this in my file, python will > crash on me. how can i fix this ? > > > What is wrong with this case? If you are concerned about the multiple > backslashes, they are there because they are special symbols, and so > they have to be escaped (you actually want a backslash, not whatever > else they could mean). > you have the bytes representation and a duplicate slash in it. Its due to unicode strings in python3. A workaround that only works for ascii is: np.loadtxt(file, dtype=bytes).astype(str) for non ascii I guess you should use python directly as numpy would also require a python loop with explicit decoding. Currently handling strings in python3 with numpy is even worse than before, you always have to go over bytes and do explicit decodes to get python strings out of ascii data. What we might need in numpy is new string xtypes specifying encodings to allow sane conversion to python3 strings without the excessive memory usage of 4 byte unicode (ucs-4). e.g. if its ascii reuse a (which currently maps to bytes) np.loadtxt(file, dtype='a') for utf 8 data: d = np.loadtxt(file, dtype='utf8') so that type(d[0]) is unicode and not bytes as is currently the case if you don't want to store your arrays in 4 bytes per character. From jtaylor.debian at googlemail.com Wed Jan 15 07:43:50 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 15 Jan 2014 13:43:50 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D68161.7090807@googlemail.com> References: <52D68161.7090807@googlemail.com> Message-ID: <52D68286.5060908@googlemail.com> On 01/15/2014 01:38 PM, Julian Taylor wrote: > On 01/15/2014 11:25 AM, Da?id wrote: >> On 15 January 2014 11:12, Hedieh Ebrahimi for utf 8 data: > > d = np.loadtxt(file, dtype='utf8') > ups this is a very bad example as we can't have utf8 as its variable length, but we can have ascii and ucs-2 for lower footprint encodings with proper python string integration. From chris.barker at noaa.gov Wed Jan 15 12:27:28 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 15 Jan 2014 09:27:28 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D68161.7090807@googlemail.com> References: <52D68161.7090807@googlemail.com> Message-ID: On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > > I try to print my fileContent array after I read it and it looks > > like this : > > > > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" > > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" > > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] > > you have the bytes representation and a duplicate slash in it. > the duplicate slash confuses me, but I'm not running py3 to test, so... > np.loadtxt(file, dtype=bytes).astype(str) > > for non ascii I guess you should use python directly as numpy would also > require a python loop with explicit decoding. > > Currently handling strings in python3 with numpy is even worse than > before, you always have to go over bytes and do explicit decodes to get > python strings out of ascii data. > There is a MASSIVE set of threads on Python-dev about better support for ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have two issue shere that could be adressed: 1) loadtext behavior -- it's a really, really common case for data files suitable for loadtxt to be ascii, but they also could be another encoding -- so loadtext should have the option to specify the encoding (default to ascii? or ascii-compatible?) The trick here is handling both these cases correctly -- clearly loadtxt is broken on py3 now. This example works fine under py2. It seems to be reading the file as bytes, then passing those bytes off to a unicode string (str in py3), without specifying an encoding (which I think is how that b' ...' junk gets in there. note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as well: In [7]: np.loadtxt('pathlist.txt', dtype=unicode) Out[7]: array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt', u'C:\\Users\\Documents\\Project\\mytextfile2.txt', u'C:\\Users\\Documents\\Project\\mytextfile3.txt'], dtype=' From charlesr.harris at gmail.com Wed Jan 15 12:57:51 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 15 Jan 2014 10:57:51 -0700 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> Message-ID: On Wed, Jan 15, 2014 at 10:27 AM, Chris Barker wrote: > On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor < > jtaylor.debian at googlemail.com> wrote: > >> > I try to print my fileContent array after I read it and it looks >> > like this : >> > >> > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" >> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" >> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] >> > > >> you have the bytes representation and a duplicate slash in it. >> > > the duplicate slash confuses me, but I'm not running py3 to test, so... > > >> np.loadtxt(file, dtype=bytes).astype(str) >> >> for non ascii I guess you should use python directly as numpy would also >> require a python loop with explicit decoding. >> >> Currently handling strings in python3 with numpy is even worse than >> before, you always have to go over bytes and do explicit decodes to get >> python strings out of ascii data. >> > > There is a MASSIVE set of threads on Python-dev about better support for > ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have > two issue shere that could be adressed: > > 1) loadtext behavior -- it's a really, really common case for data files > suitable for loadtxt to be ascii, but they also could be another encoding > -- so loadtext should have the option to specify the encoding (default to > ascii? or ascii-compatible?) > > The trick here is handling both these cases correctly -- clearly loadtxt > is broken on py3 now. This example works fine under py2. > > It seems to be reading the file as bytes, then passing those bytes off to > a unicode string (str in py3), without specifying an encoding (which I > think is how that b' ...' > junk gets in there. > > note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as > well: > > In [7]: np.loadtxt('pathlist.txt', dtype=unicode) > Out[7]: > array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt', > u'C:\\Users\\Documents\\Project\\mytextfile2.txt', > u'C:\\Users\\Documents\\Project\\mytextfile3.txt'], > dtype=' > which is what should happen in py3. So the internal loadtxt code must be > confusing bytes and unicode objects... > > Anyway, this should work, and there should be an obvious way to spell it. > > 2) numpy string types -- it seems numpy already has a both a string type > and unicode type -- perhaps some re-naming or better documentation is in > order: > the string type 'S10', for example, should be clearly defined as 1-byte > per character ascii-compatible. > > I'm not sure how many bytes the unicode type has, but it may make sense to > be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go > with UCS-4 and be done with it. > There was a discussion of this long ago and UCS-4 was chosen as the numpy standard. There are just too many complications that arise in supporting both. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Jan 15 13:25:31 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 15 Jan 2014 19:25:31 +0100 Subject: [Numpy-discussion] adding more unicode dtypes In-Reply-To: References: <52D68161.7090807@googlemail.com> Message-ID: <52D6D29B.8020509@googlemail.com> On 15.01.2014 18:57, Charles R Harris wrote: > ... > > There was a discussion of this long ago and UCS-4 was chosen as the > numpy standard. There are just too many complications that arise in > supporting both. > my guess is that that discussion was before python3 and you could still simply treat bytes == string? In python3 you need extra code to deal with arrays containing strings as the S type is interpreted as bytes which is not a string type anymore [0]. Someone on irc (I think Freddie Witherden CC'd) had a use case with huge ascii tables in numpy which now have to be stored as 4 bytes unicode on disk or decode bytes all the time. I personally don't use strings in arrays so I can neither judge the impact nor the use, but it seems to me like at least having an ascii dtype for python2<->python3 compatibility would be useful. [0] https://github.com/numpy/numpy/issues/4162 From chris.barker at noaa.gov Wed Jan 15 14:40:58 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 15 Jan 2014 11:40:58 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> Message-ID: On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris wrote: > There was a discussion of this long ago and UCS-4 was chosen as the numpy > standard. There are just too many complications that arise in supporting > both. > fair enough -- but loadtxt appears to be broken just the same. Any proposals for that? My proposal: loadtxt accepts an encoding argument. default is ascii -- that's what it's doing now, anyway, yes? If the file is encoded ascii, then a one-byte-per character dtype is used for text data, unless the user specifies otherwise (do they need to specify anyway?) If the file has another encoding, the the default dtype for text is unicode. Not sure about other one-byte per character encodings (e.g. latin-1) The defaults may be moot, if the loadtxt doesn't have auto-detection of text in a filie anyway. This all required that there be an obvious way for the user to spell the one-byte-per character dtype -- I think 'S' will do it. Note to OP: what happens if you specify 'S' for your dtype, rather than str - it works for me on py2: In [16]: np.loadtxt('pathlist.txt', dtype='S') Out[16]: array(['C:\\Users\\Documents\\Project\\mytextfile1.txt', 'C:\\Users\\Documents\\Project\\mytextfile2.txt', 'C:\\Users\\Documents\\Project\\mytextfile3.txt'], dtype='|S42') Note: this leaves us with what to pass back to the user when they index into an array of type 'S*' -- a bytes object or a unicode object (decoded as ascii). I think a unicode object, in keeping with proper py3 behavior. This would be like we currently do with, say floating point numbers: We can store/operate with 32 bit floats, but when you pass it back as a python type, you get the native python float -- 64bit. NOTE: another option is to use latin-1 all around, rather than ascii -- you may get garbage from the higher value bytes, but it won't barf on you. -Chris > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Jan 15 15:07:35 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 15 Jan 2014 12:07:35 -0800 Subject: [Numpy-discussion] adding more unicode dtypes In-Reply-To: <52D6D29B.8020509@googlemail.com> References: <52D68161.7090807@googlemail.com> <52D6D29B.8020509@googlemail.com> Message-ID: Julian -- beat me to it! On Wed, Jan 15, 2014 at 10:25 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > On 15.01.2014 18:57, Charles R Harris wrote: > > There was a discussion of this long ago and UCS-4 was chosen as the > > numpy standard. There are just too many complications that arise in > > supporting both. > supporting both UCS-4 and UCS-2 would be more pain than it's worth. > In python3 you need extra code to deal with arrays containing strings as > the S type is interpreted as bytes which is not a string type anymore [0]. > ouch! I was just assuming that it still was -- yes, I really think we need a one-byte-per char string type -- probably ascii, but we could do latin-1 and let the buyer beware of the higher value bytes Someone on irc (I think Freddie Witherden CC'd) had a use case with huge > ascii tables in numpy which now have to be stored as 4 bytes unicode on > disk or decode bytes all the time. > and ascii data is not the least bit rare in the science world in particular. > I personally don't use strings in arrays so I can neither judge the > impact nor the use, but it seems to me like at least having an ascii > dtype for python2<->python3 compatibility would be useful. > I think py2<->py3 compatibilty is a separate issue -- we should have this if it's a good thing to have, not because of that. And it is a good thing to have. And since this is a new thread -- regardless of the decision on this, loadtxt is broken -- we certainly should be able to parse ascii text and return something reasonable -- unicode strings would have been fine in the OPs case, if they didn't have the extra bytes to tring crap in them. [0] https://github.com/numpy/numpy/issues/4162 from that: The transition towards split string/bytes types in Python 3 has the unfortunate side effect of breaking the following snippet: np.array("Hello", dtype="|S").item() == "Hello" Sorry for not testing in py3, but this makes it look like the "S" dtype is one-byte per char strings, but creates a bytes object, rather than a unicode (py3 str) object. As in my other note, I think it would be better to have it return a unicode string by default. But it looks like you can still use it to store large quantities of ascii data if you want. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Wed Jan 15 15:15:15 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Wed, 15 Jan 2014 12:15:15 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array (Charles R Harris) Message-ID: On Wed, Jan 15, 2014 at 9:52 AM, wrote: > Date: Wed, 15 Jan 2014 10:57:51 -0700 > From: Charles R Harris > Subject: Re: [Numpy-discussion] using loadtxt to load a text file in > to a numpy array > To: Discussion of Numerical Python > Message-ID: > < > CAB6mnxJpvJbsoZzY0Ctk1bk+kDCUDivC9KrzYt1johU33bZOLw at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > On Wed, Jan 15, 2014 at 10:27 AM, Chris Barker >wrote: > > There was a discussion of this long ago and UCS-4 was chosen as the numpy > standard. There are just too many complications that arise in supporting > both. > > Chuck > In that case, perhaps another function altogether is called for. DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Jan 15 18:42:35 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 15 Jan 2014 15:42:35 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: Message-ID: bump back to the OP: On Wed, Jan 15, 2014 at 2:12 AM, Hedieh Ebrahimi < hedieh.ebrahimi at amphos21.com> wrote: > fileContent=loadtxt(filePath,dtype=str) > do either of these work for you? fileContent=loadtxt(filePath,dtype='S') or fileContent=loadtxt(filePath,dtype=np.unicode) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Jan 15 18:58:25 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 16 Jan 2014 00:58:25 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: Message-ID: <52D720A1.9060100@googlemail.com> On 16.01.2014 00:42, Chris Barker wrote: > bump back to the OP: > On Wed, Jan 15, 2014 at 2:12 AM, Hedieh Ebrahimi > > wrote: > > fileContent=loadtxt(filePath,dtype=str) > > > do either of these work for you? > > fileContent=loadtxt(filePath,dtype='S') this gives you bytes not a string, this can only be fixed by adding new dtypes, see the other thread about that. > > or > > fileContent=loadtxt(filePath,dtype=np.unicode) > same as using python str you get the output originally posted, bytes representation with duplicated slashes. This is a bug in loadtxt we need to fix independent of adding new dtypes. It is also independent of the encoding of the text file, loadtxt doesn't seem to be able to open other encodings than ascii/utf8 at all and has no option to tell it what the file is. as mentioned in my earlier mail this works for ascii: np.loadtxt('test.txt',dtype=bytes).astype(str) or of course looping and decoding explicitly. From oscar.j.benjamin at gmail.com Wed Jan 15 19:06:22 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 16 Jan 2014 00:06:22 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D68161.7090807@googlemail.com> References: <52D68161.7090807@googlemail.com> Message-ID: On 15 January 2014 12:38, Julian Taylor wrote: > On 01/15/2014 11:25 AM, Da?id wrote: >> On 15 January 2014 11:12, Hedieh Ebrahimi > > wrote: >> >> I try to print my fileContent array after I read it and it looks >> like this : >> >> ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" >> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" >> "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] >> >> Why is this happening and how can I prevent it ? >> Also if I have a line that starts like this in my file, python will >> crash on me. how can i fix this ? >> >> >> What is wrong with this case? If you are concerned about the multiple >> backslashes, they are there because they are special symbols, and so >> they have to be escaped (you actually want a backslash, not whatever >> else they could mean). >> > > you have the bytes representation and a duplicate slash in it. > Its due to unicode strings in python3. So why does the array store the repr of a bytes string? Surely that's just a loadtxt bug and no one is actually depending on that behaviour. Oscar From chris.barker at noaa.gov Wed Jan 15 20:10:07 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 15 Jan 2014 17:10:07 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D720A1.9060100@googlemail.com> References: <52D720A1.9060100@googlemail.com> Message-ID: On Wed, Jan 15, 2014 at 3:58 PM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > > fileContent=loadtxt(filePath,dtype='S') > > this gives you bytes not a string, this can only be fixed by adding new > dtypes, or changing the behavior or dtype 'S', but yes, the other thread. But the OP's problem was not that s/he got bytes, but that the content was wrong -- he got the repr of bytes in a py3 string. - > same as using python str you get the output originally posted, bytes > representation with duplicated slashes. > This is a bug in loadtxt we need to fix independent of adding new dtypes. > yup. > It is also independent of the encoding of the text file, loadtxt doesn't > seem to be able to open other encodings than ascii/utf8 at all and has > no option to tell it what the file is. > a key missing feature -- and I doubt it does utf-8 right, either. as mentioned in my earlier mail this works for ascii: > > np.loadtxt('test.txt',dtype=bytes).astype(str) > thanks -- I wasn't sure what astype would do for that. and what are you getting then, unicode or ascii? Thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Thu Jan 16 05:43:05 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 16 Jan 2014 10:43:05 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> Message-ID: <20140116104303.GA11119@gmail.com> On Wed, Jan 15, 2014 at 11:40:58AM -0800, Chris Barker wrote: > On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris > wrote: > > > > There was a discussion of this long ago and UCS-4 was chosen as the numpy > > standard. There are just too many complications that arise in supporting > > both. > > > > fair enough -- but loadtxt appears to be broken just the same. Any > proposals for that? > > My proposal: > > loadtxt accepts an encoding argument. > > default is ascii -- that's what it's doing now, anyway, yes? No it's loading the file reading a line, encoding the line with latin-1, and then putting the repr of the resulting byte-string as a unicode string into a UCS-4 array (dtype=' > If the file is encoded ascii, then a one-byte-per character dtype is used > for text data, unless the user specifies otherwise (do they need to specify > anyway?) > > If the file has another encoding, the the default dtype for text is unicode. That's a silly idea. There's already the dtype='S' for ascii that will give one byte per character. However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads the file as text with the default system encoding, encodes the text with latin-1 and stores the resulting bytes into a dtype='S' array. I think it should just open the file in binary read the bytes and store them in the dtype='S' array. The current behaviour strikes me as a hangover from the Python 2.x 8-bit text model. > Not sure about other one-byte per character encodings (e.g. latin-1) > > The defaults may be moot, if the loadtxt doesn't have auto-detection of > text in a filie anyway. > > This all required that there be an obvious way for the user to spell the > one-byte-per character dtype -- I think 'S' will do it. They should use 'S' and not encoding='ascii'. If the user provides an encoding then it should be used to open the file and decode it to unicode resulting in a dtype='U' array. (Python 3 handles this all for you). > Note to OP: what happens if you specify 'S' for your dtype, rather than str > - it works for me on py2: > > In [16]: np.loadtxt('pathlist.txt', dtype='S') > Out[16]: > array(['C:\\Users\\Documents\\Project\\mytextfile1.txt', > 'C:\\Users\\Documents\\Project\\mytextfile2.txt', > 'C:\\Users\\Documents\\Project\\mytextfile3.txt'], > dtype='|S42') It only seems to work because you're using ascii data. On Py3 you'll have byte strings corresponding to the text in the file encoded as latin-1 (regardless of the encoding used in the file). loadtxt doesn't open the file in binary or specify an encoding so the file will be opened with the system default encoding as determined by the standard builtins.open. The resulting text is decoded according to that encoding and then reencoded as latin-1 which will corrupt the binary form of the data if the system encoding is not compatible with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not). > > Note: this leaves us with what to pass back to the user when they index > into an array of type 'S*' -- a bytes object or a unicode object (decoded > as ascii). I think a unicode object, in keeping with proper py3 behavior. > This would be like we currently do with, say floating point numbers: > > We can store/operate with 32 bit floats, but when you pass it back as a > python type, you get the native python float -- 64bit. > > NOTE: another option is to use latin-1 all around, rather than ascii -- you > may get garbage from the higher value bytes, but it won't barf on you. I guess you're alluding to the idea that reading/writing files as latin-1 will pretend to seamlessly decode/encode any bytes preserving binary data in any round-trip. This concept is already broken if you intend to do any processing, indexing or slicing of the array. Additionally the current loadtxt behaviour fails to achieve this round-trip even for the 'S' dtype even if you don't do any processing: $ ipython3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: with open('tmp.py', 'w') as fout: # Implicitly utf-8 here fout.write('??\n' * 3) ...: In [2]: import numpy In [3]: a = numpy.loadtxt('tmp.py') ValueError: could not convert string to float: b'\xc5\xe5' In [4]: a = numpy.loadtxt('tmp.py', dtype='S') In [5]: a Out[5]: array([b'\xc5\xe5', b'\xc5\xe5', b'\xc5\xe5'], dtype='|S2') In [6]: a.tostring() Out[6]: b'\xc5\xe5\xc5\xe5\xc5\xe5' In [7]: with open('tmp.py', 'rb') as fin: ...: text = fin.read() ...: In [8]: text Out[8]: b'\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n' This is a mess. I don't know about how to handle backwards compatibility but the sensible way to handle this in *both* Python 2 and 3 is that dtype='S' opens the file in binary, reads byte strings, and stores them in an array with dtype='S'. dtype='U' should open the file as text with an encoding argument (or system default if not supplied), decode the bytes and create an array with dtype='U'. The only reasonable difference between Python 2 and 3 is which of these two behaviours dtype=str should do. Oscar From chris.barker at noaa.gov Thu Jan 16 12:08:38 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 16 Jan 2014 09:08:38 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140116104303.GA11119@gmail.com> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: On Thu, Jan 16, 2014 at 2:43 AM, Oscar Benjamin wrote: > > My proposal: > > > > loadtxt accepts an encoding argument. > > > > default is ascii -- that's what it's doing now, anyway, yes? > > No it's loading the file reading a line, encoding the line with latin-1, > and > then putting the repr of the resulting byte-string as a unicode string > into a > UCS-4 array (dtype=' If the file is encoded ascii, then a one-byte-per character dtype is used > > for text data, unless the user specifies otherwise (do they need to > specify > > anyway?) > > > > If the file has another encoding, the the default dtype for text is > unicode. > > That's a silly idea. There's already the dtype='S' for ascii that will give > one byte per character. > Except that 'S' is being translated to a bytes object, and in py3 bytes is not really text -- see the other thread. However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads > the file as text with the default system encoding, not such a bad idea in principle, but I think with scientific data files in particular, the file was just as likely generated on a different system, so system settings should be avoided. My guess is that a large fraction of systems have system encodings that are ascii-compatible, so we'll get away with this most of the time, but explicit is better than implicit, and all that. encodes the text with > latin-1 and stores the resulting bytes into a dtype='S' array. I think it > should just open the file in binary read the bytes and store them in the > dtype='S' array. The current behaviour strikes me as a hangover from the > Python 2.x 8-bit text model. > not sure it's even that -- I suspect it's a broken attempt to match the py3 text model... > Not sure about other one-byte per character encodings (e.g. latin-1) The > defaults may be moot, if the loadtxt doesn't have auto-detection of text in > a filie anyway. > I'm not suggesting auto0detection, but I am suggesting the ability to specify an encoding, and in that case, we need a default, and I don't think it should be the system encoding. > This all required that there be an obvious way for the user to spell the > > one-byte-per character dtype -- I think 'S' will do it. > > They should use 'S' and not encoding='ascii'. that is stating implicitly that 'S' is ascii-compatible, but it gets traslated to the py3 bytes type, which the pyton dev folks REALLY want to mean "arbitrary bytes", rather than 'ascii text'. practically, it means you need to decode it to use it as text -- compare with a string, etc... If the user provides an encoding > then it should be used to open the file and decode it to unicode resulting > in > a dtype='U' array. (Python 3 handles this all for you). I think it may be an important use case to pull ansi-compatible text out of a file and put it into a 1-byte per character dtype (i.,e 'S'). Folks don't necessarily want or need 4 bytes per charater. In practice this probably only makes sense it the file is in an ascii-compatible encoding anyway, but I like the idea of keeping the file encoding and the dtype independent. It only seems to work because you're using ascii data. > (or latin-1?) well, yes, but that was the OP's example. though it was file names, so he'd probably ultimately want them as py3 strings... > which will > corrupt the binary form of the data if the system encoding is not > compatible > with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not). a good reason not to use the system default encoding! > NOTE: another option is to use latin-1 all around, rather than ascii -- > you > > may get garbage from the higher value bytes, but it won't barf on you. > > I guess you're alluding to the idea that reading/writing files as latin-1 > will > pretend to seamlessly decode/encode any bytes preserving binary data in any > round-trip. yes, exactly -- a practical common use case is that there is non-ascii compliant bytes in a data stream, but that the use-case doesn't care what they are. If you use ascii, then you get exceptions you don't need to get. > This concept is already broken if you intend to do any processing, > indexing or slicing of the array. no it's not -- latin-1 is ascii-compatible (as is utf-8), so a lot of processing will work fine -- splitting on whitespace or whatever, etc. yes, indexing can go to heck if you have utf-8 or, of course, non-ascii compatible encoding -- but that's never going to work without specifying an encoding anyway. > Additionally the current loadtxt behaviour > fails to achieve this round-trip even for the 'S' dtype even if you don't > do > any processing: > right -- I think we agree that it's broken now. This is a mess. I don't know about how to handle backwards compatibility but > the sensible way to handle this in *both* Python 2 and 3 is that dtype='S' > opens the file in binary, reads byte strings, and stores them in an array > with > dtype='S'. dtype='U' should open the file as text with an encoding argument > (or system default if not supplied), decode the bytes and create an array > with > dtype='U'. agreed -- except for the system encoding part.... > The only reasonable difference between Python 2 and 3 is which of > these two behaviours dtype=str should do. well, str is a py3 string in py3 -- so it should be dtype 'U'. Personally, I avoid using the native types for dtype arguemtns anyway, so users should use: dtype=np.unicode or dtype=np.string0 (or np.string_) -- or???? How do you spell the dtype that 'S' give you???? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at climate.com Thu Jan 16 19:28:53 2014 From: shoyer at climate.com (Stephan Hoyer) Date: Thu, 16 Jan 2014 16:28:53 -0800 Subject: [Numpy-discussion] Allowing slices as arguments for ndarray.take Message-ID: There was a discussion last year about slicing along specified axes in numpy arrays: http://mail.scipy.org/pipermail/numpy-discussion/2012-April/061632.html I'm finding that slicing along specified axes is a common task for me when writing code to manipulate N-D arrays. The method ndarray.take basically does what I would like, except it cannot take slice objects as argument. In the mean-time, I've written a little helper function: def take(a, indices, axis): index = [slice(None)] * a.ndim index[axis] = indices return a[tuple(index)] Is there support for allowing the `indices` argument to `take` to take Python slice objects as well as arrays? That would alleviate the need for my helper function. Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Jan 17 04:12:00 2014 From: sebastian at sipsolutions.net (sebastian) Date: Fri, 17 Jan 2014 09:12:00 +0000 Subject: [Numpy-discussion] Allowing slices as arguments for ndarray.take In-Reply-To: References: Message-ID: <8639a503927ee75ee1e116b2e7e7b814@sipsolutions.net> On 2014-01-17 00:28, Stephan Hoyer wrote: > There was a discussion last year about slicing along specified axes in > numpy arrays: > http://mail.scipy.org/pipermail/numpy-discussion/2012-April/061632.html > [1] > > I'm finding that slicing along specified axes is a common task for me > when writing code to manipulate N-D arrays. > > The method ndarray.take basically does what I would like, except it > cannot take slice objects as argument. In the mean-time, I've written > a little helper function: > > def take(a, indices, axis): > ? ? index = [slice(None)] * a.ndim > ? ? index[axis] = indices > ? ? return a[tuple(index)] > > Is there support for allowing the `indices` argument to `take` to take > Python slice objects as well as arrays? That would alleviate the need > for my helper function. > > Cheers, > Stephan > > > > Links: > ------ > [1] > http://mail.scipy.org/pipermail/numpy-discussion/2012-April/061632.html > > Hey, Personally, I am not sure that generalizing take is the right approach. Take is currently orthogonal to indexing implementation wise and has some smaller differences. Given a good idea for the api, I think a new function maybe better. Since I am not on a computer at the moment I did not check the old discussions though. - Sebastian _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jtaylor.debian at googlemail.com Fri Jan 17 04:38:15 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 17 Jan 2014 10:38:15 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: This thread is getting a little out of hand which is my fault for initially mixing different topics in one mail, so let me try to summarize: We have three issues here: - a loadtxt bug when loading strings in python3 this has nothing to do with encodings or dtypes it is a bug that should be fixed. Not more not less. the fix is probably removing a repr() somewhere and converting the data to unicode as the user requested as str == unicode in py3, this is the normal change you must account for when migrating to p3. - no possibility to specify the encoding of a file in loadtxt this is a missing feature, currently it uses the system default which is good and should stay that way. It is only missing an option to tell it to treat it differently. There should be little debate about changing the default, especially not using latin1. The system default exists for a good reason. Note on linux it is UTF-8 which is a good choice. I'm not familiar with windows but all programs should at least have the option to use UTF-8 as output too. This has nothing to do with indexing or any kind of processing of the numpy arrays. The fix should be trivial to do, just add an encoding keyword argument and pass it on to python. The workaround should be passing a file object to loadtxt instead of a file name. Python file objects already have the encoding argument. - inconvenience in dealing with strings in python 3. bytes are not strings in python3 which means ascii data is either a byte array which can be inconvenient to deal with or 4 byte unicode which wastes space. A proposal to fix this would be to add a one or two byte dtype with a specific encoding that behaves similar to bytes but converts to string when outputting to python for comparisons etc. For backward compatibility we *cannot* change S. Maybe we could change the meaning of 'a' but it would be safer to add a new dtype, possibly 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype. The main issue is probably: is it worth it and who does the work? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pav at iki.fi Fri Jan 17 05:59:27 2014 From: pav at iki.fi (Pauli Virtanen) Date: Fri, 17 Jan 2014 10:59:27 +0000 (UTC) Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: Julian Taylor googlemail.com> writes: [clip] > - inconvenience in dealing with strings in python 3. > > bytes are not strings in python3 which means ascii data is either a byte > array which can be inconvenient to deal with or 4 byte unicode which > wastes space. > > A proposal to fix this would be to add a one or two byte dtype with a specific > encoding that behaves similar to bytes but converts to string when outputting > to python for comparisons etc. > > For backward compatibility we *cannot* change S. Maybe we could change > the meaning of 'a' but it would be safer to add a new dtype, possibly > 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype. > > The main issue is probably: is it worth it and who does the work? I don't think this is a good idea: the bytes vs. unicode separation in Python 3 exists for a good reason. If unicode is not needed, why not just use the bytes data type throughout the program? (Also, assuming that ASCII is in general good for text-format data is quite US-centric.) Christopher Barker wrote: > > How do you spell the dtype that 'S' give you???? > 'S' is bytes. dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent. -- Pauli Virtanen From pav at iki.fi Fri Jan 17 07:17:28 2014 From: pav at iki.fi (Pauli Virtanen) Date: Fri, 17 Jan 2014 12:17:28 +0000 (UTC) Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: Julian Taylor googlemail.com> writes: [clip] > For backward compatibility we *cannot* change S. > Maybe we could change the meaning of 'a' but it would be safer > to add a new dtype, possibly 'S' can be deprecated in favor > of 'B' when we have a specific encoding dtype. Note that the rename 'S' -> 'B' was not done in the Python 3 port, because 'B' already denotes uint8, >>> np.array([1], dtype='B') array([1], dtype=uint8) -- Pauli Virtanen From josef.pktd at gmail.com Fri Jan 17 07:35:42 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 07:35:42 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen wrote: > Julian Taylor googlemail.com> writes: > [clip] >> - inconvenience in dealing with strings in python 3. >> >> bytes are not strings in python3 which means ascii data is either a byte >> array which can be inconvenient to deal with or 4 byte unicode which >> wastes space. >> >> A proposal to fix this would be to add a one or two byte dtype with a specific >> encoding that behaves similar to bytes but converts to string when outputting >> to python for comparisons etc. >> >> For backward compatibility we *cannot* change S. Maybe we could change >> the meaning of 'a' but it would be safer to add a new dtype, possibly >> 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype. >> >> The main issue is probably: is it worth it and who does the work? > > I don't think this is a good idea: the bytes vs. unicode separation in > Python 3 exists for a good reason. If unicode is not needed, why not just > use the bytes data type throughout the program? > > (Also, assuming that ASCII is in general good for text-format data is > quite US-centric.) > > Christopher Barker wrote: >> >> How do you spell the dtype that 'S' give you???? >> > > 'S' is bytes. > > dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent. 'S' is bytes, is a feature not a bug, I thought. I didn't pay much attention to the two threads because I don't use loadtxt. But I think the same issue is in genfromtxt, recfromtxt, ... I don't have a lot of experience with python 3, but in the initial python 3 compatibility conversion of statsmodels, I followed numpy's lead and used the numpy helper functions and converted all strings to bytes. Everything loaded by genfromtxt or similar reades bytes, files are opened with "rb". In most places our code doesn't really care, as long as numpy.unique, and similar work either way. But in some cases there were some strange things working with bytes. There are also some weirder cases with non-ASCII "strings", and I also have problems in interactive work when the interpreter encoding interfers. Also maybe related, our Stata data file reader genfromdta handles cyrillic languages (Russian IIRC) in the same way as ascii, I don't know the details but Skipper fixed a bug so it works. I'm pretty sure interaction statsmodels/pandas/patsy has problems/bugs with non-ASCII support in variable names, but my impression is that string data as bytes causes few problems. Josef > > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From oscar.j.benjamin at gmail.com Fri Jan 17 07:44:16 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 17 Jan 2014 12:44:16 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: <20140117124414.GA2253@gmail.com> On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > Julian Taylor googlemail.com> writes: > [clip] > > - inconvenience in dealing with strings in python 3. > > > > bytes are not strings in python3 which means ascii data is either a byte > > array which can be inconvenient to deal with or 4 byte unicode which > > wastes space. It doesn't waste that much space in practice. People have been happily using Python 2's 4-byte-per-char unicode string on wide builds (e.g. on Linux) for years in all kinds of text heavy applications. $ python2 Python 2.7.3 (default, Sep 26 2013, 20:03:06) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getsizeof(u'a' * 1000) 4052 > > For backward compatibility we *cannot* change S. Do you mean to say that loadtxt cannot be changed from decoding using system default, splitting on newlines and whitespace and then encoding the substrings as latin-1? An obvious improvement would be along the lines of what Chris Barker suggested: decode as latin-1, do the processing and then reencode as latin-1. Or just open the file in binary and use the bytes string methods. Either of these has the advantage that it won't corrupt the binary representation of the data - assuming ascii-compatible whitespace and newlines (e.g. utf-8 and most currently used 8-bit encodings). In the situations where the current behaviour differs from this the user *definitely* has mojibake. Can anyone possibly be relying on that (except in the sense of having implemented a workaround that would break if it was fixed)? > > Maybe we could change > > the meaning of 'a' but it would be safer to add a new dtype, possibly > > 'S' can be deprecated in favor of 'B' when we have a specific encoding dtype. > > > > The main issue is probably: is it worth it and who does the work? > > I don't think this is a good idea: the bytes vs. unicode separation in > Python 3 exists for a good reason. If unicode is not needed, why not just > use the bytes data type throughout the program? Or on the other hand, why try to use bytes when you're clearly dealing with text data? If you're concerned about memory usage why not use Python strings? As of CPython 3.3 strings consisting only of latin-1 characters are stored with 1 char-per-byte. This is only really sensible for immutable strings with an opaque memory representation though so numpy shouldn't try to copy it. > (Also, assuming that ASCII is in general good for text-format data is > quite US-centric.) Indeed. The original use case in this thread was a text file containing file paths. In most of the world there's a reasonable chance that file paths can contain non-ascii characters. The current behaviour of decoding using one codec and encoding with latin-1 would, in many cases, break if the user tried to e.g. open() a file using a byte-string from the array. Oscar From aldcroft at head.cfa.harvard.edu Fri Jan 17 08:09:00 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Fri, 17 Jan 2014 08:09:00 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 5:59 AM, Pauli Virtanen wrote: > Julian Taylor googlemail.com> writes: > [clip] > > - inconvenience in dealing with strings in python 3. > > > > bytes are not strings in python3 which means ascii data is either a byte > > array which can be inconvenient to deal with or 4 byte unicode which > > wastes space. > > > > A proposal to fix this would be to add a one or two byte dtype with a > specific > > encoding that behaves similar to bytes but converts to string when > outputting > > to python for comparisons etc. > > > > For backward compatibility we *cannot* change S. Maybe we could change > > the meaning of 'a' but it would be safer to add a new dtype, possibly > > 'S' can be deprecated in favor of 'B' when we have a specific encoding > dtype. > > > > The main issue is probably: is it worth it and who does the work? > > I don't think this is a good idea: the bytes vs. unicode separation in > Python 3 exists for a good reason. If unicode is not needed, why not just > use the bytes data type throughout the program? > I've been playing around with porting a stack of analysis libraries to Python 3 and this is a very timely thread and comment. What I discovered right away is that all the string data coming from binary HDF5 files show up (as expected) as 'S' type,, but that trying to make everything actually work in Python 3 without converting to 'U' is a big mess of whack-a-mole. Yes, it's possible to change my libraries to use bytestring literals everywhere, but the Python 3 user experience becomes horrible because to interact with the data all downstream applications need to use bytestring literals everywhere. E.g. doing a simple filter like `string_array == 'foo'` doesn't work, and this will break all existing code when trying to run in Python 3. And every time you try to print something it has this horrible "b" in front. Ugly, and it just won't work well in the end. Following the excellent advice at http://nedbatchelder.com/text/unipain.html, I've come to the conclusion that the only way to support Python 3 is to bite the bullet and do the "unicode sandwich". That is to say convert all external bytestring values to 'U' arrays for internal (and user) manipulation, and back to 'S' for delivery to files / network etc. This is a pain and very inefficient, but at least the the Python 3 user experience is natural and pleasant. I figure if you are manipulating anything less than ~Gb of text data then it won't be a disaster. The upshot from this is that I would be very much in favor of solutions that address the inefficiency issue of using 4 bytes / character in the common use-case of pure-ASCII strings. Right now this is the single biggest issue I see for migrating to Python 3. Otherwise making the code python 2 / 3 compatible wasn't too difficult. - Tom > > (Also, assuming that ASCII is in general good for text-format data is > quite US-centric.) > > Christopher Barker wrote: > > > > How do you spell the dtype that 'S' give you???? > > > > 'S' is bytes. > > dtype='S', dtype=bytes, and dtype=np.bytes_ are all equivalent. > > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Fri Jan 17 08:10:19 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 17 Jan 2014 14:10:19 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140117124414.GA2253@gmail.com> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin wrote: > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > Julian Taylor googlemail.com> writes: > > [clip] > > > > For backward compatibility we *cannot* change S. > > Do you mean to say that loadtxt cannot be changed from decoding using > system > default, splitting on newlines and whitespace and then encoding the > substrings > as latin-1? > unicode dtypes have nothing to do with the loadtxt issue. They are not related. > > An obvious improvement would be along the lines of what Chris Barker > suggested: decode as latin-1, do the processing and then reencode as > latin-1. > no, the right solution is to add an encoding argument. Its a 4 line patch for python2 and a 2 line patch for python3 and the issue is solved, I'll file a PR later. No latin1 de/encoding is required for anything, I don't know why you would want do to that in this context. Does opening latin1 files even work with current loadtxt? It currently uses UTF-8 which is to my knowledge not compatible with latin1. -------------- next part -------------- An HTML attachment was scrubbed... URL: From freddie at witherden.org Fri Jan 17 08:18:38 2014 From: freddie at witherden.org (Freddie Witherden) Date: Fri, 17 Jan 2014 13:18:38 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: <52D92DAE.5020409@witherden.org> On 17/01/14 13:09, Aldcroft, Thomas wrote: > I've been playing around with porting a stack of analysis libraries to > Python 3 and this is a very timely thread and comment. What I > discovered right away is that all the string data coming from binary > HDF5 files show up (as expected) as 'S' type,, but that trying to make > everything actually work in Python 3 without converting to 'U' is a big > mess of whack-a-mole. > > Yes, it's possible to change my libraries to use bytestring literals > everywhere, but the Python 3 user experience becomes horrible because to > interact with the data all downstream applications need to use > bytestring literals everywhere. E.g. doing a simple filter like > `string_array == 'foo'` doesn't work, and this will break all existing > code when trying to run in Python 3. And every time you try to print > something it has this horrible "b" in front. Ugly, and it just won't > work well in the end. In terms of HDF5 it is interesting to look at how h5py -- which has to go between NumPy types and HDF5 conventions -- handles the problem as described here: http://www.h5py.org/docs/topics/strings.html which IMHO got it about right. Regards, Freddie. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From jtaylor.debian at googlemail.com Fri Jan 17 08:31:32 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 17 Jan 2014 14:31:32 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 2:10 PM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin < > oscar.j.benjamin at gmail.com> wrote:... > ... > No latin1 de/encoding is required for anything, I don't know why you would > want do to that in this context. > Does opening latin1 files even work with current loadtxt? > It currently uses UTF-8 which is to my knowledge not compatible with > latin1. > just tried it, doesn't work so there is nothing we need to keep working: f = codecs.open('test.txt', 'wt', encoding='latin1') f.write(u'??\n') f.close() np.loadtxt('test.txt') ValueError: could not convert string to float: ?? or UnicodeDecodeError: if provided with unicode dtype there are a couple more unicode issues in the test loading (it converts to bytes even if unicode is requested), but they look simple to fix. -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Fri Jan 17 08:40:34 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 17 Jan 2014 13:40:34 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> Message-ID: <20140117134033.GB2253@gmail.com> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin > wrote: > > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > > Julian Taylor googlemail.com> writes: > > > [clip] > > > > > > > For backward compatibility we *cannot* change S. > > > > Do you mean to say that loadtxt cannot be changed from decoding using > > system > > default, splitting on newlines and whitespace and then encoding the > > substrings > > as latin-1? > > > > unicode dtypes have nothing to do with the loadtxt issue. They are not > related. I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier, if the file is not encoded as ascii or latin-1 then the byte strings are corrupted (see below). This is because loadtxt opens the file with the default system encoding (by not explicitly specifying an encoding): https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 It then processes each line with asbytes() which encodes them as latin-1: https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 Being an English speaker I don't normally use non-ascii characters in filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or (and rightly so!). > > > > An obvious improvement would be along the lines of what Chris Barker > > suggested: decode as latin-1, do the processing and then reencode as > > latin-1. > > > > no, the right solution is to add an encoding argument. > Its a 4 line patch for python2 and a 2 line patch for python3 and the issue > is solved, I'll file a PR later. What is the encoding argument for? Is it to be used to decode, process the text and then re-encode it for an array with dtype='S'? Note that there are two encodings: one for reading from the file and one for storing in the array. The former describes the content of the file and the latter will be used if I extract a byte-string from the array and pass it to any Python API. > No latin1 de/encoding is required for anything, I don't know why you would > want do to that in this context. > Does opening latin1 files even work with current loadtxt? It's the only encoding that works for dtype='S'. > It currently uses UTF-8 which is to my knowledge not compatible with latin1. It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and store in the array, corrupting any non-ascii characters. Here's a demonstration: $ ipython3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: with open('?scar.txt', 'w') as fout: pass In [2]: import os In [3]: os.listdir('.') Out[3]: ['?scar.txt'] In [4]: with open('filenames.txt', 'w') as fout: ...: fout.writelines([f + '\n' for f in os.listdir('.')]) ...: In [5]: with open('filenames.txt') as fin: ...: print(fin.read()) ...: filenames.txt ?scar.txt In [6]: import numpy In [7]: filenames = numpy.loadtxt('filenames.txt') ValueError: could not convert string to float: b'filenames.txt' In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S') In [9]: filenames Out[9]: array([b'filenames.txt', b'\xd5scar.txt'], dtype='|S13') In [10]: open(filenames[1]) --------------------------------------------------------------------------- IOError Traceback (most recent call last) /users/enojb/.rcs/tmp/ in () ----> 1 open(filenames[1]) IOError: [Errno 2] No such file or directory: '\udcd5scar.txt' In [11]: open('?scar.txt'.encode('utf-8')) Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'> Oscar From josef.pktd at gmail.com Fri Jan 17 09:11:22 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 09:11:22 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140117134033.GB2253@gmail.com> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 8:40 AM, Oscar Benjamin wrote: > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: >> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin >> wrote: >> >> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: >> > > Julian Taylor googlemail.com> writes: >> > > [clip] >> > >> >> > > > For backward compatibility we *cannot* change S. >> > >> > Do you mean to say that loadtxt cannot be changed from decoding using >> > system >> > default, splitting on newlines and whitespace and then encoding the >> > substrings >> > as latin-1? >> > >> >> unicode dtypes have nothing to do with the loadtxt issue. They are not >> related. > > I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier, > if the file is not encoded as ascii or latin-1 then the byte strings are > corrupted (see below). > > This is because loadtxt opens the file with the default system encoding (by > not explicitly specifying an encoding): > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 > > It then processes each line with asbytes() which encodes them as latin-1: > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > > Being an English speaker I don't normally use non-ascii characters in > filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or > (and rightly so!). > >> > >> > An obvious improvement would be along the lines of what Chris Barker >> > suggested: decode as latin-1, do the processing and then reencode as >> > latin-1. >> > >> >> no, the right solution is to add an encoding argument. >> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue >> is solved, I'll file a PR later. > > What is the encoding argument for? Is it to be used to decode, process the > text and then re-encode it for an array with dtype='S'? > > Note that there are two encodings: one for reading from the file and one for > storing in the array. The former describes the content of the file and the > latter will be used if I extract a byte-string from the array and pass it to > any Python API. > >> No latin1 de/encoding is required for anything, I don't know why you would >> want do to that in this context. >> Does opening latin1 files even work with current loadtxt? > > It's the only encoding that works for dtype='S'. > >> It currently uses UTF-8 which is to my knowledge not compatible with latin1. > > It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and > store in the array, corrupting any non-ascii characters. Here's a > demonstration: > > $ ipython3 > Python 3.2.3 (default, Sep 25 2013, 18:22:43) > Type "copyright", "credits" or "license" for more information. > > IPython 0.12.1 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > > In [1]: with open('?scar.txt', 'w') as fout: pass > > In [2]: import os > > In [3]: os.listdir('.') > Out[3]: ['?scar.txt'] > > In [4]: with open('filenames.txt', 'w') as fout: > ...: fout.writelines([f + '\n' for f in os.listdir('.')]) > ...: > > In [5]: with open('filenames.txt') as fin: > ...: print(fin.read()) > ...: > filenames.txt > ?scar.txt > > > In [6]: import numpy > > In [7]: filenames = numpy.loadtxt('filenames.txt') > > ValueError: could not convert string to float: b'filenames.txt' > > In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S') > > In [9]: filenames > Out[9]: > array([b'filenames.txt', b'\xd5scar.txt'], > dtype='|S13') > > In [10]: open(filenames[1]) > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > /users/enojb/.rcs/tmp/ in () > ----> 1 open(filenames[1]) > > IOError: [Errno 2] No such file or directory: '\udcd5scar.txt' > > In [11]: open('?scar.txt'.encode('utf-8')) > Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'> Windows seems to use consistent en/decoding throughout (example run in IDLE) Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32 >>> filenames = numpy.loadtxt('filenames.txt', dtype='S') >>> filenames array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py', b'\xd5scar.txt'], dtype='|S18') >>> fn = open(filenames[-1]) >>> fn.read() '1,2,3,hello\n5,6,7,?scar\n' >>> fn <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'> Josef > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jtaylor.debian at googlemail.com Fri Jan 17 09:12:32 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 17 Jan 2014 15:12:32 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140117134033.GB2253@gmail.com> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin wrote: > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin > > wrote: > > > > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > > > Julian Taylor googlemail.com> writes: > > > > [clip] > > > > > > > > > > For backward compatibility we *cannot* change S. > > > > > > Do you mean to say that loadtxt cannot be changed from decoding using > > > system > > > default, splitting on newlines and whitespace and then encoding the > > > substrings > > > as latin-1? > > > > > > > unicode dtypes have nothing to do with the loadtxt issue. They are not > > related. > > I'm talking about what loadtxt does with the 'S' dtype. As I showed > earlier, > if the file is not encoded as ascii or latin-1 then the byte strings are > corrupted (see below). > > This is because loadtxt opens the file with the default system encoding (by > not explicitly specifying an encoding): > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 > > It then processes each line with asbytes() which encodes them as latin-1: > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > wow this is just horrible, it might be the source of the bug. > > Being an English speaker I don't normally use non-ascii characters in > filenames but my system (Ubuntu Linux) still uses utf-8 rather than > latin-1 or > (and rightly so!). > > > > > > > An obvious improvement would be along the lines of what Chris Barker > > > suggested: decode as latin-1, do the processing and then reencode as > > > latin-1. > > > > > > > no, the right solution is to add an encoding argument. > > Its a 4 line patch for python2 and a 2 line patch for python3 and the > issue > > is solved, I'll file a PR later. > > What is the encoding argument for? Is it to be used to decode, process the > text and then re-encode it for an array with dtype='S'? > it is only used to decode the file into text, nothing more. loadtxt is supposed to load text files, it should never have to deal with bytes ever. But I haven't looked into the function deeply yet, there might be ugly surprises. The output of the array is determined by the dtype argument and not by the encoding argument. Lets please let the loadtxt issue go to rest. We know the issue, we know it can be fixed without adding anything complicated to numpy. We just have to use what python already provides us. The technical details of the fix can be discussed in the github issue. (Plan to have a look this weekend, but if someone else wants to do it let me know). -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Fri Jan 17 10:26:05 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 17 Jan 2014 15:26:05 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin > wrote: > > > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > > > > > > no, the right solution is to add an encoding argument. > > > Its a 4 line patch for python2 and a 2 line patch for python3 and the > > issue > > > is solved, I'll file a PR later. > > > > What is the encoding argument for? Is it to be used to decode, process the > > text and then re-encode it for an array with dtype='S'? > > > > it is only used to decode the file into text, nothing more. > loadtxt is supposed to load text files, it should never have to deal with > bytes ever. > But I haven't looked into the function deeply yet, there might be ugly > surprises. > > The output of the array is determined by the dtype argument and not by the > encoding argument. If the dtype is 'S' then the output should be bytes and you therefore need to encode the text; there's no such thing as storing text in bytes without an encoding. Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32' which just happens to be as simple as expressing the corresponding unicode code points as int32 so it's reasonable to think of it as "not encoded" in some sense (although endianness becomes an issue in utf-32). On 17 January 2014 14:11, wrote: > Windows seems to use consistent en/decoding throughout (example run in IDLE) The reason for the Py3k bytes/text overhaul is that there were lots of situations where things *seemed* to work until someone happens to use a character you didn't try. "Seems to" doesn't cut it! :) > Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 > 32 bit (Intel)] on win32 > >>>> filenames = numpy.loadtxt('filenames.txt', dtype='S') >>>> filenames > array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py', > b'\xd5scar.txt'], > dtype='|S18') >>>> fn = open(filenames[-1]) >>>> fn.read() > '1,2,3,hello\n5,6,7,?scar\n' >>>> fn > <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'> You don't show how you created the file. I think that in your case the content of 'filenames.txt' is correctly encoded latin-1. My guess is that you did the same as me and opened it in text mode and wrote the unicode string allowing Python to encode it for you. Judging by the encoding on fn above I'd say that it wrote the file with cp1252 which is mostly compatible with latin-1. Try it with a byte that is incompatible between cp1252 and latin-1 e.g.: In [3]: b'\x80'.decode('cp1252') Out[3]: '?' In [4]: b'\x80'.decode('latin-1') Out[4]: '\x80' In [5]: b'\x80'.decode('cp1252').encode('latin-1') --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) /users/enojb/ in () ----> 1 b'\x80'.decode('cp1252').encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) Oscar From josef.pktd at gmail.com Fri Jan 17 10:58:25 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 10:58:25 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin wrote: > On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote: >> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin >> wrote: >> >> > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: >> > > >> > > no, the right solution is to add an encoding argument. >> > > Its a 4 line patch for python2 and a 2 line patch for python3 and the >> > issue >> > > is solved, I'll file a PR later. >> > >> > What is the encoding argument for? Is it to be used to decode, process the >> > text and then re-encode it for an array with dtype='S'? >> > >> >> it is only used to decode the file into text, nothing more. >> loadtxt is supposed to load text files, it should never have to deal with >> bytes ever. >> But I haven't looked into the function deeply yet, there might be ugly >> surprises. >> >> The output of the array is determined by the dtype argument and not by the >> encoding argument. > > If the dtype is 'S' then the output should be bytes and you therefore > need to encode the text; there's no such thing as storing text in > bytes without an encoding. > > Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32' > which just happens to be as simple as expressing the corresponding > unicode code points as int32 so it's reasonable to think of it as "not > encoded" in some sense (although endianness becomes an issue in > utf-32). > > On 17 January 2014 14:11, wrote: >> Windows seems to use consistent en/decoding throughout (example run in IDLE) > > The reason for the Py3k bytes/text overhaul is that there were lots of > situations where things *seemed* to work until someone happens to use > a character you didn't try. "Seems to" doesn't cut it! :) > >> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 >> 32 bit (Intel)] on win32 >> >>>>> filenames = numpy.loadtxt('filenames.txt', dtype='S') >>>>> filenames >> array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py', >> b'\xd5scar.txt'], >> dtype='|S18') >>>>> fn = open(filenames[-1]) >>>>> fn.read() >> '1,2,3,hello\n5,6,7,?scar\n' >>>>> fn >> <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'> > > You don't show how you created the file. I think that in your case the > content of 'filenames.txt' is correctly encoded latin-1. I had created it with os.listdir but deleted some lines Running the full script again I still get the same correct answer for fn ------------ import os if 1: with open('filenames5.txt', 'w') as fout: fout.writelines([f + '\n' for f in os.listdir('.')]) with open('filenames.txt') as fin: print(fin.read()) import numpy #filenames = numpy.loadtxt('filenames.txt') filenames = numpy.loadtxt('filenames5.txt', dtype='S') fn = open(filenames[-1]) ------------ > > My guess is that you did the same as me and opened it in text mode and > wrote the unicode string allowing Python to encode it for you. Judging > by the encoding on fn above I'd say that it wrote the file with cp1252 > which is mostly compatible with latin-1. Try it with a byte that is > incompatible between cp1252 and latin-1 e.g.: > > In [3]: b'\x80'.decode('cp1252') > Out[3]: '?' > > In [4]: b'\x80'.decode('latin-1') > Out[4]: '\x80' > > In [5]: b'\x80'.decode('cp1252').encode('latin-1') > --------------------------------------------------------------------------- > UnicodeEncodeError Traceback (most recent call last) > /users/enojb/ in () > ----> 1 b'\x80'.decode('cp1252').encode('latin-1') > > UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in > position 0: ordinal not in range(256) I get similar problems when I use a file that someone else has written, however I haven't seen much problems if I do everything on Windows. The main problems I get and where I don't know how it's supposed to work in the best way is when we get "foreign" data. some examples I just played with that are closer to what we use in statsmodels but don't have any unit tests >>> filenames1 = numpy.recfromtxt(open('?scar.txt',"rb"), delimiter=',') >>> filenames1 rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xd5scar')], dtype=[('f0', '>> filenames1['f3'][-1] b'\xd5scar' >>> filenames1['f3'] == '?scar' False >>> filenames1['f3'] == '?scar'.encode('cp1252') array([False, True], dtype=bool) >>> filenames1['f3'] == 'hello' False >>> filenames1['f3'] == b'hello' array([ True, False], dtype=bool) >>> filenames1['f3'] == b'\xd5scar' array([False, True], dtype=bool) >>> filenames1['f3'] == np.array(['?scar'.encode('utf8')], 'S5') array([False, False], dtype=bool) Josef > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From oscar.j.benjamin at gmail.com Fri Jan 17 12:13:05 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 17 Jan 2014 17:13:05 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: <20140117171301.GA4168@gmail.com> On Fri, Jan 17, 2014 at 10:58:25AM -0500, josef.pktd at gmail.com wrote: > On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin > wrote: > > On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote: > > > > You don't show how you created the file. I think that in your case the > > content of 'filenames.txt' is correctly encoded latin-1. > > I had created it with os.listdir but deleted some lines You used os.listdir to generate the unicode strings that you write to the file. The underlying Win32 API returns filenames encoded as utf-16 but Python takes care of decoding them under the hood so you just get abstract unicode strings here in Python 3. It is the write method of the file object that encodes the unicode strings and hence determines the byte content of 'filenames5.txt'. You can check the fout.encoding attribute to see what encoding it uses by default. > Running the full script again I still get the same correct answer for fn > ------------ > import os > if 1: > with open('filenames5.txt', 'w') as fout: > fout.writelines([f + '\n' for f in os.listdir('.')]) > with open('filenames.txt') as fin: > print(fin.read()) > > import numpy > > #filenames = numpy.loadtxt('filenames.txt') > filenames = numpy.loadtxt('filenames5.txt', dtype='S') > fn = open(filenames[-1]) The question is what do you get when you do: In [1]: with open('tmp.txt', 'w') as fout: ...: print(fout.encoding) ...: UTF-8 I get utf-8 by default if no encoding is specified. This means that when I write to the file like so In [2]: with open('tmp.txt', 'w') as fout: ...: fout.write('?scar') ...: If I read it back in binary I get different bytes from you: In [3]: with open('tmp.txt', 'rb') as fin: ...: print(fin.read()) ...: b'\xc3\x95scar' Numpy.loadtxt will correctly decode those bytes as utf-8: In [5]: b'\xc3\x95scar'.decode('utf-8') Out[5]: '?scar' But then it reencodes them with latin-1 before storing them in the array: In [6]: b'\xc3\x95scar'.decode('utf-8').encode('latin-1') Out[6]: b'\xd5scar' This byte string will not be recognised by my Linux OS (POSIX uses bytes for filenames and an exact match is needed). So if I pass that to open() it will fail. > > I get similar problems when I use a file that someone else has > written, however I haven't seen much problems if I do everything on > Windows. If you use a proper explicit encoding then you can savetxt from any system and loadtxt on any other without corruption. > The main problems I get and where I don't know how it's supposed to > work in the best way is when we get "foreign" data. Text data needs to have metadata specifying the encoding. This is something that people who pass data around need to think about. Oscar From pav at iki.fi Fri Jan 17 13:40:41 2014 From: pav at iki.fi (Pauli Virtanen) Date: Fri, 17 Jan 2014 20:40:41 +0200 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: 17.01.2014 15:09, Aldcroft, Thomas kirjoitti: [clip] > I've been playing around with porting a stack of analysis libraries > to Python 3 and this is a very timely thread and comment. What I > discovered right away is that all the string data coming from > binary HDF5 files show up (as expected) as 'S' type,, but that > trying to make everything actually work in Python 3 without > converting to 'U' is a big mess of whack-a-mole. > > Yes, it's possible to change my libraries to use bytestring > literals everywhere, but the Python 3 user experience becomes > horrible because to interact with the data all downstream > applications need to use bytestring literals everywhere. E.g. > doing a simple filter like `string_array == 'foo'` doesn't work, > and this will break all existing code when trying to run in Python > 3. And every time you try to print something it has this horrible > "b" in front. Ugly, and it just won't work well in the end. [clip] Ok, I see your point. Having additional Unicode data types with smaller widths could be useful. On Python 2, they would then be Unicode strings, right? Thanks to Py2 automatic Unicode encoding/decoding, they might also be usable in interactive etc. use on Py2. Adding new data types in Numpy codebase takes some work, but it's possible to do. There's also an issue (as noted in the Github ticket) that array([u'foo'], dtype=bytes) encodes silently via the ASCII codec. This is probably not how it should be. -- Pauli Virtanen From jtaylor.debian at googlemail.com Fri Jan 17 14:18:47 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 17 Jan 2014 20:18:47 +0100 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> Message-ID: <52D98217.7020000@googlemail.com> On 17.01.2014 15:12, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin > > wrote: > > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin > > >wrote: > > > > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > > > Julian Taylor googlemail.com > > writes: > > > > [clip] > > > > > > > > > > For backward compatibility we *cannot* change S. > > > > > > Do you mean to say that loadtxt cannot be changed from decoding > using > > > system > > > default, splitting on newlines and whitespace and then encoding the > > > substrings > > > as latin-1? > > > > > > > unicode dtypes have nothing to do with the loadtxt issue. They are not > > related. > > I'm talking about what loadtxt does with the 'S' dtype. As I showed > earlier, > if the file is not encoded as ascii or latin-1 then the byte strings are > corrupted (see below). > > This is because loadtxt opens the file with the default system > encoding (by > not explicitly specifying an encoding): > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 > > It then processes each line with asbytes() which encodes them as > latin-1: > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > > > > wow this is just horrible, it might be the source of the bug. > > > > > Being an English speaker I don't normally use non-ascii characters in > filenames but my system (Ubuntu Linux) still uses utf-8 rather than > latin-1 or > (and rightly so!). > > > > > > > An obvious improvement would be along the lines of what Chris Barker > > > suggested: decode as latin-1, do the processing and then reencode as > > > latin-1. > > > > > > > no, the right solution is to add an encoding argument. > > Its a 4 line patch for python2 and a 2 line patch for python3 and > the issue > > is solved, I'll file a PR later. > > What is the encoding argument for? Is it to be used to decode, > process the > text and then re-encode it for an array with dtype='S'? > > > it is only used to decode the file into text, nothing more. > loadtxt is supposed to load text files, it should never have to deal > with bytes ever. > But I haven't looked into the function deeply yet, there might be ugly > surprises. > > The output of the array is determined by the dtype argument and not by > the encoding argument. > > Lets please let the loadtxt issue go to rest. > We know the issue, we know it can be fixed without adding anything > complicated to numpy. > We just have to use what python already provides us. > The technical details of the fix can be discussed in the github issue. > (Plan to have a look this weekend, but if someone else wants to do it > let me know). > Work in progress PR: https://github.com/numpy/numpy/pull/4208 I also seem to have fixed the original bug, while wasn't even my intention with that PR :) apparently it was indeed one of the broken asbytes calls. if you have applications using loadtxt please give it a try, but genfromtxt is still completely broken (and a much larger fix, asbytes everywhere) From josef.pktd at gmail.com Fri Jan 17 14:58:21 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 14:58:21 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D98217.7020000@googlemail.com> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 2:18 PM, Julian Taylor wrote: > On 17.01.2014 15:12, Julian Taylor wrote: >> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin >> > wrote: >> >> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: >> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin >> > >wrote: >> > >> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: >> > > > Julian Taylor googlemail.com >> > writes: >> > > > [clip] >> > > >> > >> > > > > For backward compatibility we *cannot* change S. >> > > >> > > Do you mean to say that loadtxt cannot be changed from decoding >> using >> > > system >> > > default, splitting on newlines and whitespace and then encoding the >> > > substrings >> > > as latin-1? >> > > >> > >> > unicode dtypes have nothing to do with the loadtxt issue. They are not >> > related. >> >> I'm talking about what loadtxt does with the 'S' dtype. As I showed >> earlier, >> if the file is not encoded as ascii or latin-1 then the byte strings are >> corrupted (see below). >> >> This is because loadtxt opens the file with the default system >> encoding (by >> not explicitly specifying an encoding): >> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 >> >> It then processes each line with asbytes() which encodes them as >> latin-1: >> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 >> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 >> >> >> >> wow this is just horrible, it might be the source of the bug. >> >> >> >> >> Being an English speaker I don't normally use non-ascii characters in >> filenames but my system (Ubuntu Linux) still uses utf-8 rather than >> latin-1 or >> (and rightly so!). >> >> > > >> > > An obvious improvement would be along the lines of what Chris Barker >> > > suggested: decode as latin-1, do the processing and then reencode as >> > > latin-1. >> > > >> > >> > no, the right solution is to add an encoding argument. >> > Its a 4 line patch for python2 and a 2 line patch for python3 and >> the issue >> > is solved, I'll file a PR later. >> >> What is the encoding argument for? Is it to be used to decode, >> process the >> text and then re-encode it for an array with dtype='S'? >> >> >> it is only used to decode the file into text, nothing more. >> loadtxt is supposed to load text files, it should never have to deal >> with bytes ever. >> But I haven't looked into the function deeply yet, there might be ugly >> surprises. >> >> The output of the array is determined by the dtype argument and not by >> the encoding argument. >> >> Lets please let the loadtxt issue go to rest. >> We know the issue, we know it can be fixed without adding anything >> complicated to numpy. >> We just have to use what python already provides us. >> The technical details of the fix can be discussed in the github issue. >> (Plan to have a look this weekend, but if someone else wants to do it >> let me know). >> > > Work in progress PR: > https://github.com/numpy/numpy/pull/4208 > > I also seem to have fixed the original bug, while wasn't even my > intention with that PR :) > apparently it was indeed one of the broken asbytes calls. > > if you have applications using loadtxt please give it a try, but > genfromtxt is still completely broken (and a much larger fix, asbytes > everywhere) does this still work? >>> numpy.loadtxt(open('?scar_3.txt',"rb"), 'S') array([b'1,2,3,hello', b'5,6,7,\xc3\x95scarscar', b'15,2,3,hello', b'20,2,3,\xc3\x95scar'], dtype='|S16') to compare >>> numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), delimiter=',') Traceback (most recent call last): File "", line 1, in numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), delimiter=',') File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", line 1828, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", line 1351, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py", line 207, in _delimited_splitter line = line.split(self.comments)[0] TypeError: Can't convert 'bytes' object to str implicitly >>> numpy.recfromtxt(open('?scar_3.txt',"rb"), delimiter=',') rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'), (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')], dtype=[('f0', ' _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From chris.barker at noaa.gov Fri Jan 17 15:02:52 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 12:02:52 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> Message-ID: On Fri, Jan 17, 2014 at 1:38 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > > This thread is getting a little out of hand which is my fault for > initially mixing different topics in one mail, > still a bit mixed ;-) -- but I think the loadtxt issue requires a lot less discussion, so we're OK there. There have been a lot of notes here since I last commented, so I'm going stick with the loadtxt issues in this note: - no possibility to specify the encoding of a file in loadtxt > this is a missing feature, currently it uses the system default which is > good and should stay that way. > I disagree -- I think using the "system encoding" is a bad idea for a default -- I certainly am far more likely to get data files from some other system than my own -- and really unlikely to use the "system encoding" for any data files I write, either. And I'm not begin english-centered here -- my data files commonly do have non-ascii code in there, though frankly, they are either a mess or I know the encoding. What should be the default? latin-1 Why? Despite our desire to be non-english-focuses, most of what loadtxt does is parse files for numbers, maybe with a bit of text. Numbers are virtually always ascii-compatible (am I wrong about that? -- if so you'd damn well better know your encoding!). So it should be an ascii-compatible encoding. Why not ascii? -- because then it would barf on non-ascii text in the file -- really bad idea there. Why not utf-8 -- this is being *nic centric -- and utf-8 will wrk fine on ascii, but corrupt non-asci,, non-utf-8 data (i.e. any other encoding.) and may barf on some of ti too (not sure about that). latin-1 will never barf on any binary data, will successfully parse any numeric data (plus spaces, commas, etc.), and will preserve the bytes of an non-ascii content in the file. If you can set the encoding it's not a huge deal what the default is, but I will recommend that everyone always either sets it to a known encoding or uses latin-1 -- never the system encoding. One more point: on my system right now: In [15]: sys.getdefaultencoding() Out[15]: 'ascii' please don't make loadttxt start barfing on files I've been reading just fine for years.... It is only missing an option to tell it to treat it differently. > There should be little debate about changing the default, especially not > using latin1. The system default exists for a good reason. > Maybe, maybe not, but I submit that whatever that "good reason" is, it does not apply here! This is kin dof like datetime64 using the localle timezone -- makes it useless! > Note on linux it is UTF-8 which is a good choice. I'm not familiar with > windows but all programs should at least have the option to use UTF-8 as > output too. > should, yes, so, maybe, but: a) not all text data files are written recently or by recently updated software. b) This is kind of like saying we should have loadtxt default to utf-8, which wouldn't be the worst idea -- better than system default, but still not as good as latin-1 This is a simple question: Should the exact same file read fine with the exact same code on one machine, but not another? I don't think so. This has nothing to do with indexing or any kind of processing of the numpy > arrays. > agreed. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 17 15:17:58 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 12:17:58 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: >>> numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), delimiter=',') > Traceback (most recent call last): > File "", line 1, in > numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), > delimiter=',') > File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", > line 1828, in recfromtxt > output = genfromtxt(fname, **kwargs) > File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", > line 1351, in genfromtxt > first_values = split_line(first_line) > File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py", > line 207, in _delimited_splitter > line = line.split(self.comments)[0] > TypeError: Can't convert 'bytes' object to str implicitly > That's pretty broken -- if you know the encoding, you should certainly be able to get a proper unicode string out of it.. > >>> numpy.recfromtxt(open('?scar_3.txt',"rb"), delimiter=',') > rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'), > (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')], > dtype=[('f0', ' So the problem here is that recfromtxt is making all "text" bytes objects. ('S' ?) -- which is probably not what you want particularly if you specify an encoding. Though I can't figure out at the moment why the previous one failed -- where did the bytes object come from when the encoding was specified? By the way -- this is apparently a utf-file with some non-ascii text in it. By my proposal, without an encoding specified, it should default to latin-1: In that case, you might get unicode string objects that are incorrectly decoded. But: it would not raise an exception you could recover the proper text with: the_text.encode(latin-1).decode('utf-8') On the other hand, if this was as ascii-compatible non-utf8 encoding file, and we tried to read it as utf-8, it could barf on the non-ascii text altogether, and if it didn't the non-ascii text would be corrupted and impossible to recover. I think the issue is that I'm not really proposing latin-1 -- I'm proposing "a ascii compatible encoding that will do the right thing with ascii bytes, and pass through any other bytes untouched" - latin-1, at least as implemented by Python, satisfies that criterium. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 17 15:30:06 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 12:30:06 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <52D92DAE.5020409@witherden.org> References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: On Fri, Jan 17, 2014 at 5:18 AM, Freddie Witherden wrote: > In terms of HDF5 it is interesting to look at how h5py -- which has to > go between NumPy types and HDF5 conventions -- handles the problem as > described here: > > http://www.h5py.org/docs/topics/strings.html from that: """All strings in HDF5 hold encoded text. You can?t store arbitrary binary data in HDF5 strings. """ This is actually the same as a py3 string (though the mechanism may be completely different), and the problem with numpy's 'S' - is it text or bytes? Given the name and history, it should be text, but apparently people have been using t for bytes, so we have to keep that meaning/use case. But I suggest, that like Python3 -- we official declare that you should not consider it text, and not do any implicite conversions. Which means we could use a one-byte-per-character text dtype. """At the high-level interface, h5py exposes three kinds of strings. Each maps to a specific type within Python (but see str_py3 below): Fixed-length ASCII (NumPy S type) .... """ This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not an ASCII string (even though I wish it were...). But clearly the HDF folsk think we need one! """ Fixed-length ASCII These are created when you use numpy.string_: >>> dset.attrs["name"] = numpy.string_("Hello") or the S dtype: >>> dset = f.create_dataset("string_ds", (100,), dtype="S10") """ Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form another post, I thought you'd need to use numpy.bytes_ (which is the same on py2) """Variable-length ASCII These are created when you assign a byte string to an attribute: >>> dset.attrs["attr"] = b"Hello" or when you create a dataset with an explicit ?bytes? vlen type: >>> dt = h5py.special_dtype(vlen=bytes) >>> dset = f.create_dataset("name", (100,), dtype=dt) Note that they?re not fully identical to Python byte strings. """ This implies that HDF would be well served by an ascii text type. """ What about NumPy?s U type? NumPy also has a Unicode type, a UTF-32 fixed-width format (4-byte characters). HDF5 has no support for wide characters. Rather than trying to hack around this and ?pretend? to support it, h5py will raise an error when attempting to create datasets or attributes of this type. """ Interesting, though I think irrelevant to this conversation but it would be nice if HDFpy would encode/decode to/from utf-8 for these. -Chris > which IMHO got it about right. > > Regards, Freddie. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Fri Jan 17 15:36:12 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 15:36:12 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 3:17 PM, Chris Barker wrote: > >>> numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), > delimiter=',') >> >> Traceback (most recent call last): >> File "", line 1, in >> numpy.recfromtxt(open('?scar_3.txt',"r", encoding='utf8'), >> delimiter=',') >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", >> line 1828, in recfromtxt >> output = genfromtxt(fname, **kwargs) >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", >> line 1351, in genfromtxt >> first_values = split_line(first_line) >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py", >> line 207, in _delimited_splitter >> line = line.split(self.comments)[0] >> TypeError: Can't convert 'bytes' object to str implicitly > > > That's pretty broken -- if you know the encoding, you should certainly be > able to get a proper unicode string out of it.. > >> >> >>> numpy.recfromtxt(open('?scar_3.txt',"rb"), delimiter=',') >> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'), >> (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')], >> dtype=[('f0', ' > > So the problem here is that recfromtxt is making all "text" bytes objects. > ('S' ?) -- which is probably not what you want particularly if you specify > an encoding. Though I can't figure out at the moment why the previous one > failed -- where did the bytes object come from when the encoding was > specified? Yes, it's a utf-8 file with nonascii. I don't know what I **should** want. For now I do want bytes, because that's how I changed statsmodels in the py3 conversion. This was just based on the fact that recfromtxt doesn't work with strings on python 3, so I switched to using bytes following the lead of numpy. I'm mainly worried about backwards compatibility, since we have been using this for 2 or 3 years. It would be easy to change in statsmodels when gen/recfromtxt is fixed, but I assume there is lots of other code using similar interpretation of S/bytes in numpy. Josef > > By the way -- this is apparently a utf-file with some non-ascii text in it. > By my proposal, without an encoding specified, it should default to latin-1: > > In that case, you might get unicode string objects that are incorrectly > decoded. But: > > it would not raise an exception > > you could recover the proper text with: > > the_text.encode(latin-1).decode('utf-8') > > On the other hand, if this was as ascii-compatible non-utf8 encoding file, > and we tried to read it as utf-8, it could barf on the non-ascii text > altogether, and if it didn't the non-ascii text would be corrupted and > impossible to recover. > > I think the issue is that I'm not really proposing latin-1 -- I'm proposing > "a ascii compatible encoding that will do the right thing with ascii bytes, > and pass through any other bytes untouched" - latin-1, at least as > implemented by Python, satisfies that criterium. > > -Chris > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From chris.barker at noaa.gov Fri Jan 17 15:56:42 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 12:56:42 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: Small note: Being an English speaker I don't normally use non-ascii characters in > filenames but my system (Ubuntu Linux) still uses utf-8 rather than > latin-1 or > (and rightly so!). just to be really clear -- encoding for filenames and encoding for file content have nothing to do with each-other. sys.getdefaultencoding() is _supposed_ to be a default encoding for file content -- not file names. And of course you need to use the system file name encoding for file names! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 17 16:20:39 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 13:20:39 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 12:36 PM, wrote: > > ('S' ?) -- which is probably not what you want particularly if you > specify > > an encoding. Though I can't figure out at the moment why the previous one > > failed -- where did the bytes object come from when the encoding was > > specified? > > Yes, it's a utf-8 file with nonascii. > > I don't know what I **should** want. > well, you **should** want: The numbers parsed out for you (Other wise, why use recfromtxt), and the text as properly decoded unicode strings. Python does very well with unicode -- and you are MUCH happier if you do the encoding/decoding as close to I/O as possible. recfromtxt is, in a way, decoding already, converting ascii representation of numbers to an internal binary representation -- why not handle the text at the same time. There certainly are use cases for keeping the text as encoded bytes, but I'd say those fall into the categories of: 1) Special case 2) You should know what you are doing. So having recfromtxt auto-determine that for you makes little sense. Note that if you don't know the file encoding, this is tricky. My thoughts: 1) don't use the system default encoding!!! (see my other note on that!) 2) Either: a) open as a binary file and use bytes for anything that doesn't parse as text -- this means that the user will need to do the conversion to text themselves b) decode as latin-1: this would work well for ascii and _some_ non-ascii text, and would be recoverable for ALL text. I prefer (b). The point here is that if the user gets bytes, then they will either have to assume ascii, or need to hand-decode it, and if they just want assume ascii, they have a bytes object with limited text functionality so will probably need to decode it anyway (unless they are just passing it through) If the user gets unicode objects that are may not properly decoded, they can either assume it was ascii, and if they only do ascii-compatible things with it, it will work, or they can encode/decode it and get the proper stuff back, but only if they know the encoding, and if that's the case, why did they not specify that in the first place? > For now I do want bytes, because that's how I changed statsmodels in > the py3 conversion. > > This was just based on the fact that recfromtxt doesn't work with > strings on python 3, so I switched to using bytes following the lead > of numpy. > Well, that's really too bad -- it doesn't sound like you wanted bytes, it sounds like you wanted something that didn't crash -- fair enough. But the "proper" solution is for recfromtext to support text.... I'm mainly worried about backwards compatibility, since we have been > using this for 2 or 3 years. It would be easy to change in statsmodels > when gen/recfromtxt is fixed, but I assume there is lots of other code > using similar interpretation of S/bytes in numpy. > well, it does sound like enough folks are using 'S' to mean bytes -- too bad, but what can we do now about that? I'd like a 's' for ascii-stings though. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Fri Jan 17 16:43:58 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 16:43:58 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker wrote: > On Fri, Jan 17, 2014 at 12:36 PM, wrote: >> >> > ('S' ?) -- which is probably not what you want particularly if you >> > specify >> > an encoding. Though I can't figure out at the moment why the previous >> > one >> > failed -- where did the bytes object come from when the encoding was >> > specified? >> >> Yes, it's a utf-8 file with nonascii. >> >> I don't know what I **should** want. > > > well, you **should** want: > > The numbers parsed out for you (Other wise, why use recfromtxt), and the > text as properly decoded unicode strings. > > Python does very well with unicode -- and you are MUCH happier if you do the > encoding/decoding as close to I/O as possible. recfromtxt is, in a way, > decoding already, converting ascii representation of numbers to an internal > binary representation -- why not handle the text at the same time. > > There certainly are use cases for keeping the text as encoded bytes, but I'd > say those fall into the categories of: > > 1) Special case > 2) You should know what you are doing. > > So having recfromtxt auto-determine that for you makes little sense. > > Note that if you don't know the file encoding, this is tricky. My thoughts: > > 1) don't use the system default encoding!!! (see my other note on that!) > > 2) Either: > a) open as a binary file and use bytes for anything that doesn't parse > as text -- this means that the user will need to do the conversion to text > themselves > > b) decode as latin-1: this would work well for ascii and _some_ non-ascii > text, and would be recoverable for ALL text. > > I prefer (b). The point here is that if the user gets bytes, then they will > either have to assume ascii, or need to hand-decode it, and if they just > want assume ascii, they have a bytes object with limited text functionality > so will probably need to decode it anyway (unless they are just passing it > through) > > If the user gets unicode objects that are may not properly decoded, they can > either assume it was ascii, and if they only do ascii-compatible things with > it, it will work, or they can encode/decode it and get the proper stuff > back, but only if they know the encoding, and if that's the case, why did > they not specify that in the first place? > >> >> For now I do want bytes, because that's how I changed statsmodels in >> the py3 conversion. >> >> This was just based on the fact that recfromtxt doesn't work with >> strings on python 3, so I switched to using bytes following the lead >> of numpy. > > > Well, that's really too bad -- it doesn't sound like you wanted bytes, it > sounds like you wanted something that didn't crash -- fair enough. But the > "proper" solution is for recfromtext to support text.... But also solution 2a) is fine for most of the code Often it doesn't really matter >>> dta_4 array([(1, 2, 3, b'hello', 'hello'), (5, 6, 7, b'\xc3\x95scarscar', '?scarscar'), (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', '?scar')], dtype=[('f0', '>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int) array([[1, 0, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int) array([[1, 0, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) similar doing a for loop comparing to the uniques. bytes are fine and nobody has to tell me what encoding they are using. It doesn't work so well for pretty printing results, so using there latin-1 as you describe above might be a good solution if users don't decode to text/string Josef > >> I'm mainly worried about backwards compatibility, since we have been >> using this for 2 or 3 years. It would be easy to change in statsmodels >> when gen/recfromtxt is fixed, but I assume there is lots of other code >> using similar interpretation of S/bytes in numpy. > > > well, it does sound like enough folks are using 'S' to mean bytes -- too > bad, but what can we do now about that? > > I'd like a 's' for ascii-stings though. > > -Chris > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From chris.barker at noaa.gov Fri Jan 17 16:55:56 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 13:55:56 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 1:43 PM, wrote: > > 2) Either: > > a) open as a binary file and use bytes for anything that doesn't > parse > > as text -- this means that the user will need to do the conversion to > text > > themselves > > > > b) decode as latin-1: this would work well for ascii and _some_ > non-ascii > > text, and would be recoverable for ALL text. > > But also solution 2a) is fine for most of the code > Often it doesn't really matter > indeed -- I did list it as an option ;-) > >>> dta_4 > array([(1, 2, 3, b'hello', 'hello'), > (5, 6, 7, b'\xc3\x95scarscar', '?scarscar'), > (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', > '?scar')], > dtype=[('f0', ' 'S10'), ('f4', ' > >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > > similar doing a for loop comparing to the uniques. > bytes are fine and nobody has to tell me what encoding they are using. > and this same operation would work fine if that text was in (possibly improperly decoded) unicode objects. > It doesn't work so well for pretty printing results, so using there > latin-1 as you describe above might be a good solution if users don't > decode to text/string > exactly -- if you really need to work with the text, you need to know the encoding. Period. End of Story. If you don't know the encoding then there is still some stuff you can do with it, so you want something that: a) won't barf on any input b) will preserve the bytes if you need to pass them along, or compare them, or... Either bytes or latin-1 decoded strings will work for that. bytes are better, as it's more explicit that you may not have valid text here. unicode strings are better as you can do stringy things with them. Either way, you'll need to encode or decode to get full functionality. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 17 17:30:19 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 17 Jan 2014 14:30:19 -0800 Subject: [Numpy-discussion] A one-byte string dtype? Message-ID: Folks, I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here): Would it be a good thing for numpy to have a one-byte--per-character string type? We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now. The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like: np.array(['some text',], dtype='S')[0] == 'some text' yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case. So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays. However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding. So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice: (note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...) It could use the 's' (lower-case s) type identifier. For passing to/from python built-in objects, it would * Allow either Python bytes objects or Python unicode objects as input a) bytes objects would be passed through as-is b) unicode objects would be encoded as latin-1 [note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case] * It would create python unicode text objects, decoded as latin-1. Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system. I've explained the latin-1 thing on other threads, but the short version is: - It will work perfectly for ascii text - It will work perfectly for latin-1 text (natch) - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in. - It will preserve those arbitrary bytes through a encoding/decoding operation. (it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...) So: Bad idea all around: shut up already! or Fine idea, but who's going to write the code? not me! or We really should do this. (of course, with the options of amending the above not-very-fleshed out proposal) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Fri Jan 17 17:40:47 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Fri, 17 Jan 2014 17:40:47 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: On Fri, Jan 17, 2014 at 4:43 PM, wrote: > On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker > wrote: > > On Fri, Jan 17, 2014 at 12:36 PM, wrote: > >> > >> > ('S' ?) -- which is probably not what you want particularly if you > >> > specify > >> > an encoding. Though I can't figure out at the moment why the previous > >> > one > >> > failed -- where did the bytes object come from when the encoding was > >> > specified? > >> > >> Yes, it's a utf-8 file with nonascii. > >> > >> I don't know what I **should** want. > > > > > > well, you **should** want: > > > > The numbers parsed out for you (Other wise, why use recfromtxt), and the > > text as properly decoded unicode strings. > > > > Python does very well with unicode -- and you are MUCH happier if you do > the > > encoding/decoding as close to I/O as possible. recfromtxt is, in a way, > > decoding already, converting ascii representation of numbers to an > internal > > binary representation -- why not handle the text at the same time. > > > > There certainly are use cases for keeping the text as encoded bytes, but > I'd > > say those fall into the categories of: > > > > 1) Special case > > 2) You should know what you are doing. > > > > So having recfromtxt auto-determine that for you makes little sense. > > > > Note that if you don't know the file encoding, this is tricky. My > thoughts: > > > > 1) don't use the system default encoding!!! (see my other note on that!) > > > > 2) Either: > > a) open as a binary file and use bytes for anything that doesn't > parse > > as text -- this means that the user will need to do the conversion to > text > > themselves > > > > b) decode as latin-1: this would work well for ascii and _some_ > non-ascii > > text, and would be recoverable for ALL text. > > > > I prefer (b). The point here is that if the user gets bytes, then they > will > > either have to assume ascii, or need to hand-decode it, and if they just > > want assume ascii, they have a bytes object with limited text > functionality > > so will probably need to decode it anyway (unless they are just passing > it > > through) > > > > If the user gets unicode objects that are may not properly decoded, they > can > > either assume it was ascii, and if they only do ascii-compatible things > with > > it, it will work, or they can encode/decode it and get the proper stuff > > back, but only if they know the encoding, and if that's the case, why did > > they not specify that in the first place? > > > >> > >> For now I do want bytes, because that's how I changed statsmodels in > >> the py3 conversion. > >> > >> This was just based on the fact that recfromtxt doesn't work with > >> strings on python 3, so I switched to using bytes following the lead > >> of numpy. > > > > > > Well, that's really too bad -- it doesn't sound like you wanted bytes, it > > sounds like you wanted something that didn't crash -- fair enough. But > the > > "proper" solution is for recfromtext to support text.... > > But also solution 2a) is fine for most of the code > Often it doesn't really matter > > >>> dta_4 > array([(1, 2, 3, b'hello', 'hello'), > (5, 6, 7, b'\xc3\x95scarscar', '?scarscar'), > (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', > '?scar')], > dtype=[('f0', ' 'S10'), ('f4', ' > >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > > similar doing a for loop comparing to the uniques. > bytes are fine and nobody has to tell me what encoding they are using. > >From my perspective bytes are not fine, at least if you want to use normal string literals in Python 3: In [64]: dat Out[64]: array([(1, 2, 3, b'hello', 'hello'), (5, 6, 7, b'\xc3\x95scarscar', '?scarscar'), (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', '?scar')], dtype=[('f0', ' in () ----> 1 'The 3rd element of f3 is "{}"'.format(dat['f3'][3]) RuntimeError: maximum recursion depth exceeded while calling a Python object > > It doesn't work so well for pretty printing results, so using there > latin-1 as you describe above might be a good solution if users don't > decode to text/string > > Josef > > > > >> I'm mainly worried about backwards compatibility, since we have been > >> using this for 2 or 3 years. It would be easy to change in statsmodels > >> when gen/recfromtxt is fixed, but I assume there is lots of other code > >> using similar interpretation of S/bytes in numpy. > > > > > > well, it does sound like enough folks are using 'S' to mean bytes -- too > > bad, but what can we do now about that? > > > > I'd like a 's' for ascii-stings though. > > > > -Chris > > > > -- > > > > Christopher Barker, Ph.D. > > Oceanographer > > > > Emergency Response Division > > NOAA/NOS/OR&R (206) 526-6959 voice > > 7600 Sand Point Way NE (206) 526-6329 fax > > Seattle, WA 98115 (206) 526-6317 main reception > > > > Chris.Barker at noaa.gov > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Fri Jan 17 18:05:16 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Fri, 17 Jan 2014 18:05:16 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: On Fri, Jan 17, 2014 at 5:30 PM, Chris Barker wrote: > Folks, > > I've been blathering away on the related threads a lot -- sorry if it's > too much. It's gotten a bit tangled up, so I thought I'd start a new one to > address this one question (i.e. dont bring up genfromtext here): > > Would it be a good thing for numpy to have a one-byte--per-character > string type? > > We did have that with the 'S' type in py2, but the changes in py3 have > made it not quite the right thing. And it appears that enough people use > 'S' in py3 to mean 'bytes', so that we can't change that now. > > The only difference may be that 'S' currently auto translates to a bytes > object, resulting in things like: > > np.array(['some text',], dtype='S')[0] == 'some text' > > yielding False on Py3. And you can't do all the usual text stuff with the > resulting bytes object, either. (and it probably used the default encoding > to generate the bytes, so will barf on some inputs, though that may be > unavoidable.) So you need to decode the bytes that are given back, and now > that I think about it, I have no idea what encoding you'd need to use in > the general case. > > So the correct solution is (particularly on py3) to use the 'U' (unicode) > dtype for text in numpy arrays. > > However, the 'U' dtype is 4 bytes per character, and that may be "too big" > for some use-cases. And there is a lot of text in scientific data sets that > are pure ascii, or at least some 1-byte-per-character encoding. > > So, in the spirit of having multiple numeric types that use different > amounts of memory, and can hold different ranges of values, a one-byte-per > character dtype would be nice: > > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I > personally don't think that's worth it, but maybe that's because I'm an > english speaker...) > > It could use the 's' (lower-case s) type identifier. > > For passing to/from python built-in objects, it would > > * Allow either Python bytes objects or Python unicode objects as input > a) bytes objects would be passed through as-is > b) unicode objects would be encoded as latin-1 > > [note: I'm not entirely sure that bytes objects should be allowed, but it > would provide an nice efficiency in a fairly common case] > > * It would create python unicode text objects, decoded as latin-1. > > Could we have a way to specify another encoding? I'm not sure how that > would fit into the dtype system. > > I've explained the latin-1 thing on other threads, but the short version > is: > > - It will work perfectly for ascii text > - It will work perfectly for latin-1 text (natch) > - It will never give you an UnicodeEncodeError regardless of what > arbitrary bytes you pass in. > - It will preserve those arbitrary bytes through a encoding/decoding > operation. > > (it still wouldn't allow you to store arbitrary unicode -- but that's the > limitation of one-byte per character...) > > So: > > Bad idea all around: shut up already! > > or > > Fine idea, but who's going to write the code? not me! > > or > > We really should do this. > As evident from what I said in the previous thread, YES, this should really be done! One important feature would be changing the dtype from 'S' to 's' without any memory copies, so that conversion would be very cheap. Maybe this would essentially come for free with something like astype('s', copy=False). - Tom > > (of course, with the options of amending the above not-very-fleshed out > proposal) > > -Chris > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Fri Jan 17 21:15:51 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 17 Jan 2014 21:15:51 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> Message-ID: It looks like both recfromtxt and loadtxt are flexible enough to handle string/bytes en/decoding, - with a bit of work and using enough information >>> dtype=[('f0', '>> data = numpy.recfromtxt(open('?scar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')}) >>> data['f3'] == '?scar' array([False, False, False, True], dtype=bool) >>> data rec.array([(1, 2, 3, 'hello'), (5, 6, 7, '?scarscar'), (15, 2, 3, 'hello'), (20, 2, 3, '?scar')], dtype=[('f0', '>> data = numpy.loadtxt(open('?scar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')}) >>> data array([(1, 2, 3, 'hello'), (5, 6, 7, '?scarscar'), (15, 2, 3, 'hello'), (20, 2, 3, '?scar')], dtype=[('f0', '>> Josef From pjrandew at sun.ac.za Sat Jan 18 05:40:28 2014 From: pjrandew at sun.ac.za (Randewijk, PJ, Dr ) Date: Sat, 18 Jan 2014 10:40:28 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <20140117124414.GA2253@gmail.com> <20140117134033.GB2253@gmail.com> <52D98217.7020000@googlemail.com> , Message-ID: Gestuur vanaf my Samsung S3 Mini -------- Original message -------- From: josef.pktd at gmail.com Date: 18/01/2014 04:16 (GMT+02:00) To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array It looks like both recfromtxt and loadtxt are flexible enough to handle string/bytes en/decoding, - with a bit of work and using enough information >>> dtype=[('f0', '>> data = numpy.recfromtxt(open('?scar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')}) >>> data['f3'] == '?scar' array([False, False, False, True], dtype=bool) >>> data rec.array([(1, 2, 3, 'hello'), (5, 6, 7, '?scarscar'), (15, 2, 3, 'hello'), (20, 2, 3, '?scar')], dtype=[('f0', '>> data = numpy.loadtxt(open('?scar_3.txt',"rb"), dtype=dtype, delimiter=',',converters={3:lambda x: x.decode('utf8')}) >>> data array([(1, 2, 3, 'hello'), (5, 6, 7, '?scarscar'), (15, 2, 3, 'hello'), (20, 2, 3, '?scar')], dtype=[('f0', '>> Josef _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ________________________________ E-pos vrywaringsklousule Hierdie e-pos mag vertroulike inligting bevat en mag regtens geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u hiermee in kennis gestel dat u hierdie dokument geensins mag gebruik, versprei of kopieer nie. Stel ook asseblief die sender onmiddellik per telefoon in kennis en vee die e-pos uit. Die Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van enige l?ers aangeheg by hierdie e-pos nie. E-mail disclaimer This e-mail may contain confidential information and may be legally privileged and is intended only for the person to whom it is addressed. If you are not the intended recipient, you are notified that you may not use, distribute or copy this document in any manner whatsoever. Kindly also notify the sender immediately by telephone, and delete the e-mail. The University does not accept liability for any damage, loss or expense arising from this e-mail and/or accessing any files attached to this e-mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenny.stone125 at gmail.com Sat Jan 18 13:48:33 2014 From: jenny.stone125 at gmail.com (jennifer stone) Date: Sun, 19 Jan 2014 00:18:33 +0530 Subject: [Numpy-discussion] (no subject) Message-ID: Hello, This is Jennifer Stupensky. I would like to contribute to NumPy this GSoC. What are the potential projects that can be taken up within the scope of GSoC? Thanks a lot in anticipation Regards Jennifer -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jan 18 14:07:32 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 18 Jan 2014 12:07:32 -0700 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: Hi Jennifer, On Sat, Jan 18, 2014 at 11:48 AM, jennifer stone wrote: > Hello, > This is Jennifer Stupensky. I would like to contribute to NumPy this GSoC. > What are the potential projects that can be taken up within the scope of > GSoC? Thanks a lot in anticipation > Regards > What are your interests and experience? If you use numpy, are there things you would like to fix, or enhancements you would like to see? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Mon Jan 20 05:11:15 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Mon, 20 Jan 2014 10:11:15 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: <20140120101113.GA2178@gmail.com> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote: > Folks, > > I've been blathering away on the related threads a lot -- sorry if it's too > much. It's gotten a bit tangled up, so I thought I'd start a new one to > address this one question (i.e. dont bring up genfromtext here): > > Would it be a good thing for numpy to have a one-byte--per-character string > type? If you mean a string type that can only hold latin-1 characters then I think that this is a step backwards. If you mean a dtype that holds bytes in a known, specifiable encoding and automatically decodes them to unicode strings when you call .item() and has a friendly repr() then that may be a good idea. So for example you could have dtype='S:utf-8' which would store strings encoded as utf-8 e.g.: >>> text = array(['foo', 'bar'], dtype='S:utf-8') >>> text array(['foo', 'bar'], dtype='|S3:utf-8') >>> print(a) ['foo', 'bar'] >>> a[0] 'foo' >>> a.nbytes 6 > We did have that with the 'S' type in py2, but the changes in py3 have made > it not quite the right thing. And it appears that enough people use 'S' in > py3 to mean 'bytes', so that we can't change that now. It wasn't really the right thing before either. That's why Python 3 has changed all of this. > The only difference may be that 'S' currently auto translates to a bytes > object, resulting in things like: > > np.array(['some text',], dtype='S')[0] == 'some text' > > yielding False on Py3. And you can't do all the usual text stuff with the > resulting bytes object, either. (and it probably used the default encoding > to generate the bytes, so will barf on some inputs, though that may be > unavoidable.) So you need to decode the bytes that are given back, and now > that I think about it, I have no idea what encoding you'd need to use in > the general case. You should let the user specify the encoding or otherwise require them to use the 'U' dtype. > So the correct solution is (particularly on py3) to use the 'U' (unicode) > dtype for text in numpy arrays. Absolutely. Embrace the Python 3 text model. Once you understand the how, what and why of it you'll see that it really is a good thing! > However, the 'U' dtype is 4 bytes per character, and that may be "too big" > for some use-cases. And there is a lot of text in scientific data sets that > are pure ascii, or at least some 1-byte-per-character encoding. > > So, in the spirit of having multiple numeric types that use different > amounts of memory, and can hold different ranges of values, a one-byte-per > character dtype would be nice: > > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally > don't think that's worth it, but maybe that's because I'm an english > speaker...) You could just use a 2-byte encoding with the S dtype e.g. dtype='S:utf-16-le'. > It could use the 's' (lower-case s) type identifier. > > For passing to/from python built-in objects, it would > > * Allow either Python bytes objects or Python unicode objects as input > a) bytes objects would be passed through as-is > b) unicode objects would be encoded as latin-1 > > [note: I'm not entirely sure that bytes objects should be allowed, but it > would provide an nice efficiency in a fairly common case] I think it would be a bad idea to accept bytes here. There are good reasons that Python 3 creates a barrier between the two worlds of text and bytes. Allowing implicit mixing of bytes and text is a recipe for mojibake. The TypeErrors in Python 3 are used to guard against conceptual errors that lead to data corruption. Attempting to undermine that barrier in numpy would be a backward step. I apologise if this is misplaced but there seems to be an attitude that scientific programming isn't really affected by the issues that have lead to the Python 3 text model. I think that's ridiculous; data corruption is a problem in scientific programming just as it is anywhere else. > * It would create python unicode text objects, decoded as latin-1. Don't try to bless a particular encoding and stop trying to pretend that it's possible to write a sensible system where end users don't need to worry about and specify the encoding of their data. > Could we have a way to specify another encoding? I'm not sure how that > would fit into the dtype system. If the encoding cannot be specified then the whole idea is misguided. > I've explained the latin-1 thing on other threads, but the short version is: > > - It will work perfectly for ascii text > - It will work perfectly for latin-1 text (natch) > - It will never give you an UnicodeEncodeError regardless of what > arbitrary bytes you pass in. > - It will preserve those arbitrary bytes through a encoding/decoding > operation. So what happens if I do: >>> with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s') >>> text[0] # Decodes as latin-1 leading to mojibake. I would propose that it's better to be able to do: >>> with open('myutf-8-file.txt', 'rb') as fin: ... text = numpy.fromfile(fin, dtype='s:utf-8') There's really no way to get around the fact that users need to specify the encoding of their text files. > (it still wouldn't allow you to store arbitrary unicode -- but that's the > limitation of one-byte per character...) You could if you use 'utf-8'. It would be one-byte-per-char for text that only contains ascii characters. However it would still support every character that the unicode consortium can dream up. The only possible advantage here is as a memory optimisation (potentially having a speed impact too although it could equally be a speed regression). Otherwise it just adds needless complexity to numpy and to the code that uses the new dtype as well as limiting its ability to handle unicode. How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype? Oscar From aldcroft at head.cfa.harvard.edu Mon Jan 20 10:00:55 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 20 Jan 2014 10:00:55 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: <20140120101113.GA2178@gmail.com> References: <20140120101113.GA2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin wrote: > On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote: > > Folks, > > > > I've been blathering away on the related threads a lot -- sorry if it's > too > > much. It's gotten a bit tangled up, so I thought I'd start a new one to > > address this one question (i.e. dont bring up genfromtext here): > > > > Would it be a good thing for numpy to have a one-byte--per-character > string > > type? > > If you mean a string type that can only hold latin-1 characters then I > think > that this is a step backwards. > > If you mean a dtype that holds bytes in a known, specifiable encoding and > automatically decodes them to unicode strings when you call .item() and > has a > friendly repr() then that may be a good idea. > > So for example you could have dtype='S:utf-8' which would store strings > encoded as utf-8 e.g.: > > >>> text = array(['foo', 'bar'], dtype='S:utf-8') > >>> text > array(['foo', 'bar'], dtype='|S3:utf-8') > >>> print(a) > ['foo', 'bar'] > >>> a[0] > 'foo' > >>> a.nbytes > 6 > > > We did have that with the 'S' type in py2, but the changes in py3 have > made > > it not quite the right thing. And it appears that enough people use 'S' > in > > py3 to mean 'bytes', so that we can't change that now. > > It wasn't really the right thing before either. That's why Python 3 has > changed all of this. > > > The only difference may be that 'S' currently auto translates to a bytes > > object, resulting in things like: > > > > np.array(['some text',], dtype='S')[0] == 'some text' > > > > yielding False on Py3. And you can't do all the usual text stuff with the > > resulting bytes object, either. (and it probably used the default > encoding > > to generate the bytes, so will barf on some inputs, though that may be > > unavoidable.) So you need to decode the bytes that are given back, and > now > > that I think about it, I have no idea what encoding you'd need to use in > > the general case. > > You should let the user specify the encoding or otherwise require them to > use > the 'U' dtype. > > > So the correct solution is (particularly on py3) to use the 'U' (unicode) > > dtype for text in numpy arrays. > > Absolutely. Embrace the Python 3 text model. Once you understand the how, > what > and why of it you'll see that it really is a good thing! > > > However, the 'U' dtype is 4 bytes per character, and that may be "too > big" > > for some use-cases. And there is a lot of text in scientific data sets > that > > are pure ascii, or at least some 1-byte-per-character encoding. > > > > So, in the spirit of having multiple numeric types that use different > > amounts of memory, and can hold different ranges of values, a > one-byte-per > > character dtype would be nice: > > > > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I > personally > > don't think that's worth it, but maybe that's because I'm an english > > speaker...) > > You could just use a 2-byte encoding with the S dtype e.g. > dtype='S:utf-16-le'. > > > It could use the 's' (lower-case s) type identifier. > > > > For passing to/from python built-in objects, it would > > > > * Allow either Python bytes objects or Python unicode objects as input > > a) bytes objects would be passed through as-is > > b) unicode objects would be encoded as latin-1 > > > > [note: I'm not entirely sure that bytes objects should be allowed, but it > > would provide an nice efficiency in a fairly common case] > > I think it would be a bad idea to accept bytes here. There are good reasons > that Python 3 creates a barrier between the two worlds of text and bytes. > Allowing implicit mixing of bytes and text is a recipe for mojibake. The > TypeErrors in Python 3 are used to guard against conceptual errors that > lead > to data corruption. Attempting to undermine that barrier in numpy would be > a > backward step. > > I apologise if this is misplaced but there seems to be an attitude that > scientific programming isn't really affected by the issues that have lead > to > the Python 3 text model. I think that's ridiculous; data corruption is a > problem in scientific programming just as it is anywhere else. > > > * It would create python unicode text objects, decoded as latin-1. > > Don't try to bless a particular encoding and stop trying to pretend that > it's > possible to write a sensible system where end users don't need to worry > about > and specify the encoding of their data. > > > Could we have a way to specify another encoding? I'm not sure how that > > would fit into the dtype system. > > If the encoding cannot be specified then the whole idea is misguided. > > > I've explained the latin-1 thing on other threads, but the short version > is: > > > > - It will work perfectly for ascii text > > - It will work perfectly for latin-1 text (natch) > > - It will never give you an UnicodeEncodeError regardless of what > > arbitrary bytes you pass in. > > - It will preserve those arbitrary bytes through a encoding/decoding > > operation. > > So what happens if I do: > > >>> with open('myutf-8-file.txt', 'rb') as fin: > ... text = numpy.fromfile(fin, dtype='s') > >>> text[0] # Decodes as latin-1 leading to mojibake. > > I would propose that it's better to be able to do: > > >>> with open('myutf-8-file.txt', 'rb') as fin: > ... text = numpy.fromfile(fin, dtype='s:utf-8') > > There's really no way to get around the fact that users need to specify the > encoding of their text files. > > > (it still wouldn't allow you to store arbitrary unicode -- but that's the > > limitation of one-byte per character...) > > You could if you use 'utf-8'. It would be one-byte-per-char for text that > only > contains ascii characters. However it would still support every character > that > the unicode consortium can dream up. > The only possible advantage here is as a memory optimisation (potentially > having a speed impact too although it could equally be a speed regression). > Otherwise it just adds needless complexity to numpy and to the code that > uses > the new dtype as well as limiting its ability to handle unicode. > How significant are the performance issues? Does anyone really use numpy > for > this kind of text handling? If you really are operating on gigantic text > arrays of ascii characters then is it so bad to just use the bytes dtype > and > handle decoding/encoding at the boundaries? If you're not operating on > gigantic text arrays is there really a noticeable problem just using the > 'U' > dtype? > I use numpy for giga-row arrays of short text strings, so memory and performance issues are real. As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess. >From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing. - Tom > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Mon Jan 20 10:40:42 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Mon, 20 Jan 2014 15:40:42 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> Message-ID: <20140120154039.GD2178@gmail.com> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin > wrote: > > How significant are the performance issues? Does anyone really use numpy > > for > > this kind of text handling? If you really are operating on gigantic text > > arrays of ascii characters then is it so bad to just use the bytes dtype > > and > > handle decoding/encoding at the boundaries? If you're not operating on > > gigantic text arrays is there really a noticeable problem just using the > > 'U' > > dtype? > > > > I use numpy for giga-row arrays of short text strings, so memory and > performance issues are real. > > As discussed in the previous parent thread, using the bytes dtype is really > a problem because users of a text array want to do things like filtering > (`match_rows = text_array == 'match'`), printing, or other manipulations in > a natural way without having to continually use bytestring literals or > `.decode('ascii')` everywhere. I tried converting a few packages while > leaving the arrays as bytestrings and it just ended up as a very big mess. > > From my perspective the goal here is to provide a pragmatic way to allow > numpy-based applications and end users to use python 3. Something like > this proposal seems to be the right direction, maybe not pure and perfect > but a sensible step to get us there given the reality of scientific > computing. I don't really see how writing b'match' instead of 'match' is that big a deal. And why are you needing to write .decode('ascii') everywhere? If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode? I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement. In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1. Oscar From aldcroft at head.cfa.harvard.edu Mon Jan 20 12:12:06 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 20 Jan 2014 12:12:06 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: <20140120154039.GD2178@gmail.com> References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin wrote: > On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: > > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin > > wrote: > > > How significant are the performance issues? Does anyone really use > numpy > > > for > > > this kind of text handling? If you really are operating on gigantic > text > > > arrays of ascii characters then is it so bad to just use the bytes > dtype > > > and > > > handle decoding/encoding at the boundaries? If you're not operating on > > > gigantic text arrays is there really a noticeable problem just using > the > > > 'U' > > > dtype? > > > > > > > I use numpy for giga-row arrays of short text strings, so memory and > > performance issues are real. > > > > As discussed in the previous parent thread, using the bytes dtype is > really > > a problem because users of a text array want to do things like filtering > > (`match_rows = text_array == 'match'`), printing, or other manipulations > in > > a natural way without having to continually use bytestring literals or > > `.decode('ascii')` everywhere. I tried converting a few packages while > > leaving the arrays as bytestrings and it just ended up as a very big > mess. > > > > From my perspective the goal here is to provide a pragmatic way to allow > > numpy-based applications and end users to use python 3. Something like > > this proposal seems to be the right direction, maybe not pure and perfect > > but a sensible step to get us there given the reality of scientific > > computing. > > I don't really see how writing b'match' instead of 'match' is that big a > deal. > It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter. As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'. > And why are you needing to write .decode('ascii') everywhere? >>> print("The first value is {}".format(bytestring_array[0])) On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'". > If you really > do just want to work with bytes in your own known encoding then why not > just > read and write in binary mode? > > I apologise if I'm wrong but I suspect that much of the difficulty in > getting > the bytes/unicode separation right is down to the fact that a lot of the > code > you're using (or attempting to support) hasn't yet been ported to a clean > text > model. When I started using Python 3 it took me quite a few failed attempts > at understanding the text model before I got to the point where I > understood > how it is supposed to be used. The problem was that I had been conflating > text > and bytes in many places, and that's hard to disentangle. Having fixed > most of > those problems I now understand why it is such an improvement. > > In any case I don't see anything wrong with a more efficient dtype for > representing text if the user can specify the encoding. The problem is that > numpy arrays expose their underlying memory buffer. Allowing them to > interact > directly with text strings on the one side and binary files on the other > breaches Python 3's very good text model unless the user can specify the > encoding that is to be used. Or at least if there is to be a blessed > encoding > then make it unicode-capable utf-8 instead of legacy ascii/latin-1. > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jan 20 12:17:21 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 10:17:21 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > > > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin < > oscar.j.benjamin at gmail.com> wrote: > >> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote: >> > Folks, >> > >> > I've been blathering away on the related threads a lot -- sorry if it's >> too >> > much. It's gotten a bit tangled up, so I thought I'd start a new one to >> > address this one question (i.e. dont bring up genfromtext here): >> > >> > Would it be a good thing for numpy to have a one-byte--per-character >> string >> > type? >> >> If you mean a string type that can only hold latin-1 characters then I >> think >> that this is a step backwards. >> >> If you mean a dtype that holds bytes in a known, specifiable encoding and >> automatically decodes them to unicode strings when you call .item() and >> has a >> friendly repr() then that may be a good idea. >> >> So for example you could have dtype='S:utf-8' which would store strings >> encoded as utf-8 e.g.: >> >> >>> text = array(['foo', 'bar'], dtype='S:utf-8') >> >>> text >> array(['foo', 'bar'], dtype='|S3:utf-8') >> >>> print(a) >> ['foo', 'bar'] >> >>> a[0] >> 'foo' >> >>> a.nbytes >> 6 >> >> > We did have that with the 'S' type in py2, but the changes in py3 have >> made >> > it not quite the right thing. And it appears that enough people use 'S' >> in >> > py3 to mean 'bytes', so that we can't change that now. >> >> It wasn't really the right thing before either. That's why Python 3 has >> changed all of this. >> >> > The only difference may be that 'S' currently auto translates to a bytes >> > object, resulting in things like: >> > >> > np.array(['some text',], dtype='S')[0] == 'some text' >> > >> > yielding False on Py3. And you can't do all the usual text stuff with >> the >> > resulting bytes object, either. (and it probably used the default >> encoding >> > to generate the bytes, so will barf on some inputs, though that may be >> > unavoidable.) So you need to decode the bytes that are given back, and >> now >> > that I think about it, I have no idea what encoding you'd need to use in >> > the general case. >> >> You should let the user specify the encoding or otherwise require them to >> use >> the 'U' dtype. >> >> > So the correct solution is (particularly on py3) to use the 'U' >> (unicode) >> > dtype for text in numpy arrays. >> >> Absolutely. Embrace the Python 3 text model. Once you understand the how, >> what >> and why of it you'll see that it really is a good thing! >> >> > However, the 'U' dtype is 4 bytes per character, and that may be "too >> big" >> > for some use-cases. And there is a lot of text in scientific data sets >> that >> > are pure ascii, or at least some 1-byte-per-character encoding. >> > >> > So, in the spirit of having multiple numeric types that use different >> > amounts of memory, and can hold different ranges of values, a >> one-byte-per >> > character dtype would be nice: >> > >> > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I >> personally >> > don't think that's worth it, but maybe that's because I'm an english >> > speaker...) >> >> You could just use a 2-byte encoding with the S dtype e.g. >> dtype='S:utf-16-le'. >> >> > It could use the 's' (lower-case s) type identifier. >> > >> > For passing to/from python built-in objects, it would >> > >> > * Allow either Python bytes objects or Python unicode objects as input >> > a) bytes objects would be passed through as-is >> > b) unicode objects would be encoded as latin-1 >> > >> > [note: I'm not entirely sure that bytes objects should be allowed, but >> it >> > would provide an nice efficiency in a fairly common case] >> >> I think it would be a bad idea to accept bytes here. There are good >> reasons >> that Python 3 creates a barrier between the two worlds of text and bytes. >> Allowing implicit mixing of bytes and text is a recipe for mojibake. The >> TypeErrors in Python 3 are used to guard against conceptual errors that >> lead >> to data corruption. Attempting to undermine that barrier in numpy would >> be a >> backward step. >> >> I apologise if this is misplaced but there seems to be an attitude that >> scientific programming isn't really affected by the issues that have lead >> to >> the Python 3 text model. I think that's ridiculous; data corruption is a >> problem in scientific programming just as it is anywhere else. >> >> > * It would create python unicode text objects, decoded as latin-1. >> >> Don't try to bless a particular encoding and stop trying to pretend that >> it's >> possible to write a sensible system where end users don't need to worry >> about >> and specify the encoding of their data. >> >> > Could we have a way to specify another encoding? I'm not sure how that >> > would fit into the dtype system. >> >> If the encoding cannot be specified then the whole idea is misguided. >> >> > I've explained the latin-1 thing on other threads, but the short >> version is: >> > >> > - It will work perfectly for ascii text >> > - It will work perfectly for latin-1 text (natch) >> > - It will never give you an UnicodeEncodeError regardless of what >> > arbitrary bytes you pass in. >> > - It will preserve those arbitrary bytes through a encoding/decoding >> > operation. >> >> So what happens if I do: >> >> >>> with open('myutf-8-file.txt', 'rb') as fin: >> ... text = numpy.fromfile(fin, dtype='s') >> >>> text[0] # Decodes as latin-1 leading to mojibake. >> >> I would propose that it's better to be able to do: >> >> >>> with open('myutf-8-file.txt', 'rb') as fin: >> ... text = numpy.fromfile(fin, dtype='s:utf-8') >> >> There's really no way to get around the fact that users need to specify >> the >> encoding of their text files. >> >> > (it still wouldn't allow you to store arbitrary unicode -- but that's >> the >> > limitation of one-byte per character...) >> >> You could if you use 'utf-8'. It would be one-byte-per-char for text that >> only >> contains ascii characters. However it would still support every character >> that >> the unicode consortium can dream up. > > >> The only possible advantage here is as a memory optimisation (potentially >> having a speed impact too although it could equally be a speed >> regression). >> Otherwise it just adds needless complexity to numpy and to the code that >> uses >> the new dtype as well as limiting its ability to handle unicode. > > >> How significant are the performance issues? Does anyone really use numpy >> for >> this kind of text handling? If you really are operating on gigantic text >> arrays of ascii characters then is it so bad to just use the bytes dtype >> and >> handle decoding/encoding at the boundaries? If you're not operating on >> gigantic text arrays is there really a noticeable problem just using the >> 'U' >> dtype? >> > > I use numpy for giga-row arrays of short text strings, so memory and > performance issues are real. > > As discussed in the previous parent thread, using the bytes dtype is > really a problem because users of a text array want to do things like > filtering (`match_rows = text_array == 'match'`), printing, or other > manipulations in a natural way without having to continually use bytestring > literals or `.decode('ascii')` everywhere. I tried converting a few > packages while leaving the arrays as bytestrings and it just ended up as a > very big mess. > > From my perspective the goal here is to provide a pragmatic way to allow > numpy-based applications and end users to use python 3. Something like > this proposal seems to be the right direction, maybe not pure and perfect > but a sensible step to get us there given the reality of scientific > computing. > I think that is right. Not having an effective way to handle these common scientific data sets will block acceptance of Python 3. But we do need to figure out the best way to add this functionality. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jan 20 12:21:27 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 10:21:27 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > > > On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < > oscar.j.benjamin at gmail.com> wrote: > >> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: >> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin >> > wrote: >> > > How significant are the performance issues? Does anyone really use >> numpy >> > > for >> > > this kind of text handling? If you really are operating on gigantic >> text >> > > arrays of ascii characters then is it so bad to just use the bytes >> dtype >> > > and >> > > handle decoding/encoding at the boundaries? If you're not operating on >> > > gigantic text arrays is there really a noticeable problem just using >> the >> > > 'U' >> > > dtype? >> > > >> > >> > I use numpy for giga-row arrays of short text strings, so memory and >> > performance issues are real. >> > >> > As discussed in the previous parent thread, using the bytes dtype is >> really >> > a problem because users of a text array want to do things like filtering >> > (`match_rows = text_array == 'match'`), printing, or other >> manipulations in >> > a natural way without having to continually use bytestring literals or >> > `.decode('ascii')` everywhere. I tried converting a few packages while >> > leaving the arrays as bytestrings and it just ended up as a very big >> mess. >> > >> > From my perspective the goal here is to provide a pragmatic way to allow >> > numpy-based applications and end users to use python 3. Something like >> > this proposal seems to be the right direction, maybe not pure and >> perfect >> > but a sensible step to get us there given the reality of scientific >> > computing. >> >> I don't really see how writing b'match' instead of 'match' is that big a >> deal. >> > > It's a big deal because all your existing python 2 code suddenly breaks on > python 3, even after running 2to3. Yes, you can backfix all the python 2 > code and use bytestring literals everywhere, but that is very painful and > ugly. More importantly it's very fiddly because *sometimes* you'll need to > use bytestring literals, and *sometimes* not, depending on the exact > dataset you've been handed. That's basically a non-starter. > > As you say below, the only solution is a proper separation of > bytes/unicode where everything internally is unicode. The problem is that > the existing 4-byte unicode in numpy is a big performance / memory hit. > It's even trickier because libraries will happily deliver a numpy > structured array with an 'S'-dtype field (from a binary dataset on disk), > and it's a pain to then convert to 'U' since you need to remake the entire > structured array. With a one-byte unicode the goal would be an in-place > update of 'S' to 's'. > > >> And why are you needing to write .decode('ascii') everywhere? > > > >>> print("The first value is {}".format(bytestring_array[0])) > > On Python 2 this gives "The first value is string_value", while on Python > 3 this gives "The first value is b'string_value'". > As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Mon Jan 20 13:40:32 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Mon, 20 Jan 2014 18:40:32 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Jan 20, 2014 5:21 PM, "Charles R Harris" wrote: > On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: >> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < oscar.j.benjamin at gmail.com> wrote: >>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: >>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin >>> >>> And why are you needing to write .decode('ascii') everywhere? >> >> >>> print("The first value is {}".format(bytestring_array[0])) >> >> On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'". > > As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help. If you think that anything in core Python will change so that you can mix text and bytes as above then I think you are very much mistaken. If you're referring to PEP 460/461 then you have misunderstood the purpose of those PEPs. The authors and reviewers will carefully ensure that nothing changes to make the above work the way that it did in 2.x. Oscar -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Mon Jan 20 14:37:02 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Mon, 20 Jan 2014 11:37:02 -0800 Subject: [Numpy-discussion] A one-byte string dtype? (Charles R Harris) Message-ID: On Mon, Jan 20, 2014 at 9:11 AM, wrote: > I think that is right. Not having an effective way to handle these common > scientific data sets will block acceptance of Python 3. But we do need to > figure out the best way to add this functionality. > > Chuck > Sounds like it might be time for some formal data collection, e.g., a wiki-poll of users' use-cases. (I know this wouldn't be exhaustive, but at least it will provide guidance and a "checklist" of situations we should be sure our solution covers.) DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jan 20 15:13:08 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 20 Jan 2014 15:13:08 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas wrote: > > > > On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin > wrote: >> >> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: >> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin >> > wrote: >> > > How significant are the performance issues? Does anyone really use >> > > numpy >> > > for >> > > this kind of text handling? If you really are operating on gigantic >> > > text >> > > arrays of ascii characters then is it so bad to just use the bytes >> > > dtype >> > > and >> > > handle decoding/encoding at the boundaries? If you're not operating on >> > > gigantic text arrays is there really a noticeable problem just using >> > > the >> > > 'U' >> > > dtype? >> > > >> > >> > I use numpy for giga-row arrays of short text strings, so memory and >> > performance issues are real. >> > >> > As discussed in the previous parent thread, using the bytes dtype is >> > really >> > a problem because users of a text array want to do things like filtering >> > (`match_rows = text_array == 'match'`), printing, or other manipulations >> > in >> > a natural way without having to continually use bytestring literals or >> > `.decode('ascii')` everywhere. I tried converting a few packages while >> > leaving the arrays as bytestrings and it just ended up as a very big >> > mess. >> > >> > From my perspective the goal here is to provide a pragmatic way to allow >> > numpy-based applications and end users to use python 3. Something like >> > this proposal seems to be the right direction, maybe not pure and >> > perfect >> > but a sensible step to get us there given the reality of scientific >> > computing. >> >> I don't really see how writing b'match' instead of 'match' is that big a >> deal. > > > It's a big deal because all your existing python 2 code suddenly breaks on > python 3, even after running 2to3. Yes, you can backfix all the python 2 > code and use bytestring literals everywhere, but that is very painful and > ugly. More importantly it's very fiddly because *sometimes* you'll need to > use bytestring literals, and *sometimes* not, depending on the exact dataset > you've been handed. That's basically a non-starter. > > As you say below, the only solution is a proper separation of bytes/unicode > where everything internally is unicode. The problem is that the existing > 4-byte unicode in numpy is a big performance / memory hit. It's even > trickier because libraries will happily deliver a numpy structured array > with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to > then convert to 'U' since you need to remake the entire structured array. > With a one-byte unicode the goal would be an in-place update of 'S' to 's'. > >> >> And why are you needing to write .decode('ascii') everywhere? > > >>>> print("The first value is {}".format(bytestring_array[0])) > > On Python 2 this gives "The first value is string_value", while on Python 3 > this gives "The first value is b'string_value'". Unfortunately (?) setprintoptions and set_string_function don't work with numpy scalars AFAICS. If it did then it would be possible to override the string representation. It works for arrays. I didn't find the right key for numpy.bytes_ on python 3.3 so now my interpreter can only print bytes np.set_printoptions(formatter={'all':lambda x: x.decode('ascii',errors="ignore") }) Josef > >> >> If you really >> do just want to work with bytes in your own known encoding then why not >> just >> read and write in binary mode? >> >> I apologise if I'm wrong but I suspect that much of the difficulty in >> getting >> the bytes/unicode separation right is down to the fact that a lot of the >> code >> you're using (or attempting to support) hasn't yet been ported to a clean >> text >> model. When I started using Python 3 it took me quite a few failed >> attempts >> at understanding the text model before I got to the point where I >> understood >> how it is supposed to be used. The problem was that I had been conflating >> text >> and bytes in many places, and that's hard to disentangle. Having fixed >> most of >> those problems I now understand why it is such an improvement. >> >> In any case I don't see anything wrong with a more efficient dtype for >> representing text if the user can specify the encoding. The problem is >> that >> numpy arrays expose their underlying memory buffer. Allowing them to >> interact >> directly with text strings on the one side and binary files on the other >> breaches Python 3's very good text model unless the user can specify the >> encoding that is to be used. Or at least if there is to be a blessed >> encoding >> then make it unicode-capable utf-8 instead of legacy ascii/latin-1. >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From charlesr.harris at gmail.com Mon Jan 20 15:34:56 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 13:34:56 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 11:40 AM, Oscar Benjamin wrote: > > On Jan 20, 2014 5:21 PM, "Charles R Harris" > wrote: > > On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < > aldcroft at head.cfa.harvard.edu> wrote: > >> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < > oscar.j.benjamin at gmail.com> wrote: > >>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: > >>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin > >>> > >>> And why are you needing to write .decode('ascii') everywhere? > >> > >> >>> print("The first value is {}".format(bytestring_array[0])) > >> > >> On Python 2 this gives "The first value is string_value", while on > Python 3 this gives "The first value is b'string_value'". > > > > As Nathaniel has mentioned, this is a known problem with Python 3 and > the developers are trying to come up with a solution. Python 3.4 solves > some existing problems, but this one remains. It's not just numpy here, > it's that python itself needs to provide some help. > > If you think that anything in core Python will change so that you can mix > text and bytes as above then I think you are very much mistaken. If you're > referring to PEP 460/461 then you have misunderstood the purpose of those > PEPs. The authors and reviewers will carefully ensure that nothing changes > to make the above work the way that it did in 2.x. > I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. Chuck > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Mon Jan 20 16:27:48 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Mon, 20 Jan 2014 21:27:48 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Jan 20, 2014 8:35 PM, "Charles R Harris" wrote: > > I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. Oscar -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jan 20 17:28:09 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 15:28:09 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin wrote: > > On Jan 20, 2014 8:35 PM, "Charles R Harris" > wrote: > > > > I think we may want something like PEP 393. The S datatype may be the > wrong place to look, we might want a modification of U instead so as to > transparently get the benefit of python strings. > > The approach taken in PEP 393 (the FSR) makes more sense for str than it > does for numpy arrays for two reasons: str is immutable and opaque. > > Since str is immutable the maximum code point in the string can be > determined once when the string is created before anything else can get a > pointer to the string buffer. > > Since it is opaque no one can rightly expect it to expose a particular > binary format so it is free to choose without compromising any expected > semantics. > > If someone can call buffer on an array then the FSR is a semantic change. > > If a numpy 'U' array used the FSR and consisted only of ASCII characters > then it would have a one byte per char buffer. What then happens if you put > a higher code point in? The buffer needs to be resized and the data copied > over. But then what happens to any buffer objects or array views? They > would be pointing at the old buffer from before the resize. Subsequent > modifications to the resized array would not show up in other views and > vice versa. > > I don't think that this can be done transparently since users of a numpy > array need to know about the binary representation. That's why I suggest a > dtype that has an encoding. Only in that way can it consistently have both > a binary and a text interface. > I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Jan 20 17:35:12 2014 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 20 Jan 2014 22:35:12 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris wrote: > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin > wrote: >> >> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" >> wrote: >> > >> > I think we may want something like PEP 393. The S datatype may be the >> > wrong place to look, we might want a modification of U instead so as to >> > transparently get the benefit of python strings. >> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it >> does for numpy arrays for two reasons: str is immutable and opaque. >> >> Since str is immutable the maximum code point in the string can be >> determined once when the string is created before anything else can get a >> pointer to the string buffer. >> >> Since it is opaque no one can rightly expect it to expose a particular >> binary format so it is free to choose without compromising any expected >> semantics. >> >> If someone can call buffer on an array then the FSR is a semantic change. >> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters >> then it would have a one byte per char buffer. What then happens if you put >> a higher code point in? The buffer needs to be resized and the data copied >> over. But then what happens to any buffer objects or array views? They would >> be pointing at the old buffer from before the resize. Subsequent >> modifications to the resized array would not show up in other views and vice >> versa. >> >> I don't think that this can be done transparently since users of a numpy >> array need to know about the binary representation. That's why I suggest a >> dtype that has an encoding. Only in that way can it consistently have both a >> binary and a text interface. > > > I didn't say we should change the S type, but that we should have something, > say 's', that appeared to python as a string. I think if we want transparent > string interoperability with python together with a compressed > representation, and I think we need both, we are going to have to deal with > the difficulties of utf-8. That means raising errors if the string doesn't > fit in the allotted size, etc. Mind, this is a workaround for the mass of > ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... Though, IMO any new dtype here would need a cleanup of the dtype code first so that it doesn't require yet more massive special cases all over umath.so. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From charlesr.harris at gmail.com Mon Jan 20 17:58:26 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 15:58:26 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: > On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris > wrote: > > > > > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < > oscar.j.benjamin at gmail.com> > > wrote: > >> > >> > >> On Jan 20, 2014 8:35 PM, "Charles R Harris" > >> wrote: > >> > > >> > I think we may want something like PEP 393. The S datatype may be the > >> > wrong place to look, we might want a modification of U instead so as > to > >> > transparently get the benefit of python strings. > >> > >> The approach taken in PEP 393 (the FSR) makes more sense for str than it > >> does for numpy arrays for two reasons: str is immutable and opaque. > >> > >> Since str is immutable the maximum code point in the string can be > >> determined once when the string is created before anything else can get > a > >> pointer to the string buffer. > >> > >> Since it is opaque no one can rightly expect it to expose a particular > >> binary format so it is free to choose without compromising any expected > >> semantics. > >> > >> If someone can call buffer on an array then the FSR is a semantic > change. > >> > >> If a numpy 'U' array used the FSR and consisted only of ASCII characters > >> then it would have a one byte per char buffer. What then happens if you > put > >> a higher code point in? The buffer needs to be resized and the data > copied > >> over. But then what happens to any buffer objects or array views? They > would > >> be pointing at the old buffer from before the resize. Subsequent > >> modifications to the resized array would not show up in other views and > vice > >> versa. > >> > >> I don't think that this can be done transparently since users of a numpy > >> array need to know about the binary representation. That's why I > suggest a > >> dtype that has an encoding. Only in that way can it consistently have > both a > >> binary and a text interface. > > > > > > I didn't say we should change the S type, but that we should have > something, > > say 's', that appeared to python as a string. I think if we want > transparent > > string interoperability with python together with a compressed > > representation, and I think we need both, we are going to have to deal > with > > the difficulties of utf-8. That means raising errors if the string > doesn't > > fit in the allotted size, etc. Mind, this is a workaround for the mass of > > ascii data that is already out there, not a substitute for 'U'. > > If we're going to be taking that much trouble, I'd suggest going ahead > and adding a variable-length string type (where the array itself > contains a pointer to a lookaside buffer, maybe with an optimization > for stashing short strings directly). The fixed-length requirement is > pretty onerous for lots of applications (e.g., pandas always uses > dtype="O" for strings -- and that might be a good workaround for some > people in this thread for now). The use of a lookaside buffer would > also make it practical to resize the buffer when the maximum code > point changed, for that matter... > > Though, IMO any new dtype here would need a cleanup of the dtype code > first so that it doesn't require yet more massive special cases all > over umath.so. > Worth thinking about. As another alternative, what is the minimum we need to make a restricted encoding, say latin-1, appear transparently as a unicode string to python? I know the python folks don't like this much, but I suspect something along that line will eventually be required for the http folks. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jan 20 18:12:20 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 20 Jan 2014 16:12:20 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris wrote: > > > > On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: > >> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >> wrote: >> > >> > >> > >> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >> oscar.j.benjamin at gmail.com> >> > wrote: >> >> >> >> >> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" > > >> >> wrote: >> >> > >> >> > I think we may want something like PEP 393. The S datatype may be the >> >> > wrong place to look, we might want a modification of U instead so as >> to >> >> > transparently get the benefit of python strings. >> >> >> >> The approach taken in PEP 393 (the FSR) makes more sense for str than >> it >> >> does for numpy arrays for two reasons: str is immutable and opaque. >> >> >> >> Since str is immutable the maximum code point in the string can be >> >> determined once when the string is created before anything else can >> get a >> >> pointer to the string buffer. >> >> >> >> Since it is opaque no one can rightly expect it to expose a particular >> >> binary format so it is free to choose without compromising any expected >> >> semantics. >> >> >> >> If someone can call buffer on an array then the FSR is a semantic >> change. >> >> >> >> If a numpy 'U' array used the FSR and consisted only of ASCII >> characters >> >> then it would have a one byte per char buffer. What then happens if >> you put >> >> a higher code point in? The buffer needs to be resized and the data >> copied >> >> over. But then what happens to any buffer objects or array views? They >> would >> >> be pointing at the old buffer from before the resize. Subsequent >> >> modifications to the resized array would not show up in other views >> and vice >> >> versa. >> >> >> >> I don't think that this can be done transparently since users of a >> numpy >> >> array need to know about the binary representation. That's why I >> suggest a >> >> dtype that has an encoding. Only in that way can it consistently have >> both a >> >> binary and a text interface. >> > >> > >> > I didn't say we should change the S type, but that we should have >> something, >> > say 's', that appeared to python as a string. I think if we want >> transparent >> > string interoperability with python together with a compressed >> > representation, and I think we need both, we are going to have to deal >> with >> > the difficulties of utf-8. That means raising errors if the string >> doesn't >> > fit in the allotted size, etc. Mind, this is a workaround for the mass >> of >> > ascii data that is already out there, not a substitute for 'U'. >> >> If we're going to be taking that much trouble, I'd suggest going ahead >> and adding a variable-length string type (where the array itself >> contains a pointer to a lookaside buffer, maybe with an optimization >> for stashing short strings directly). The fixed-length requirement is >> pretty onerous for lots of applications (e.g., pandas always uses >> dtype="O" for strings -- and that might be a good workaround for some >> people in this thread for now). The use of a lookaside buffer would >> also make it practical to resize the buffer when the maximum code >> point changed, for that matter... >> > The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From fhaxbox66 at googlemail.com Tue Jan 21 01:34:19 2014 From: fhaxbox66 at googlemail.com (Dr. Leo) Date: Tue, 21 Jan 2014 07:34:19 +0100 Subject: [Numpy-discussion] Creating an ndarray from an iterable over sequences In-Reply-To: References: Message-ID: <52DE14EB.8010305@gmail.com> Hi, I would like to write something like: In [25]: iterable=((i, i**2) for i in range(10)) In [26]: a=np.fromiter(iterable, int32) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 a=np.fromiter(iterable, int32) ValueError: setting an array element with a sequence. Is there an efficient way to do this? Creating two 1-dimensional arrays first is costly as one has to iterate twice over the data. So the only way I see is creating an empty [10,2] array and filling it row by row. This is memory-efficient but slow. List comprehension is vice versa. If there is no solution, wouldn't it be possible to rewrite fromiter so as to accept sequences? Leo From oscar.j.benjamin at gmail.com Tue Jan 21 06:13:36 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Tue, 21 Jan 2014 11:13:36 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120154039.GD2178@gmail.com> Message-ID: <20140121111334.GC2632@gmail.com> On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote: > On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris > On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: > >> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris wrote: > >> > > >> > I didn't say we should change the S type, but that we should have > >> something, > >> > say 's', that appeared to python as a string. I think if we want > >> transparent > >> > string interoperability with python together with a compressed > >> > representation, and I think we need both, we are going to have to deal > >> with > >> > the difficulties of utf-8. That means raising errors if the string > >> doesn't > >> > fit in the allotted size, etc. Mind, this is a workaround for the mass > >> of > >> > ascii data that is already out there, not a substitute for 'U'. > >> > >> If we're going to be taking that much trouble, I'd suggest going ahead > >> and adding a variable-length string type (where the array itself > >> contains a pointer to a lookaside buffer, maybe with an optimization > >> for stashing short strings directly). The fixed-length requirement is > >> pretty onerous for lots of applications (e.g., pandas always uses > >> dtype="O" for strings -- and that might be a good workaround for some > >> people in this thread for now). The use of a lookaside buffer would > >> also make it practical to resize the buffer when the maximum code > >> point changed, for that matter... > >> > The more I think about it, the more I think we may need to do that. Note > that dynd has ragged arrays and I think they are implemented as pointers to > buffers. The easy way for us to do that would be a specialization of object > arrays to string types only as you suggest. This wouldn't necessarily help for the gigarows of short text strings use case (depending on what "short" means). Also even if it technically saves memory you may have a greater overhead from fragmenting your array all over the heap. On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory saving over dtype='U' only if the strings are 17 characters or more. To get a 50% saving over dtype='U' you'd need strings of at least 49 characters. If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. Using utf-8 in the buffers eliminates the need to go around checking maximum code points etc. so I would guess that would be simpler to implement (CPython has now had to triple all of it's code paths that actually access the string buffer). Oscar From njs at pobox.com Tue Jan 21 06:41:30 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 21 Jan 2014 11:41:30 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: <20140121111334.GC2632@gmail.com> References: <20140120154039.GD2178@gmail.com> <20140121111334.GC2632@gmail.com> Message-ID: On 21 Jan 2014 11:13, "Oscar Benjamin" wrote: > If the Numpy array would manage the buffers itself then that per string memory > overhead would be eliminated in exchange for an 8 byte pointer and at least 1 > byte to represent the length of the string (assuming you can somehow use > Pascal strings when short enough - null bytes cannot be used). This gives an > overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory > if the strings are more than 3 characters long and you get at least a 50% > saving for strings longer than 9 characters. There are various optimisations possible as well. For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the "pointer" itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.) In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From e.antero.tammi at gmail.com Tue Jan 21 06:55:14 2014 From: e.antero.tammi at gmail.com (eat) Date: Tue, 21 Jan 2014 13:55:14 +0200 Subject: [Numpy-discussion] Creating an ndarray from an iterable over sequences In-Reply-To: <52DE14EB.8010305@gmail.com> References: <52DE14EB.8010305@gmail.com> Message-ID: Hi, On Tue, Jan 21, 2014 at 8:34 AM, Dr. Leo wrote: > Hi, > > I would like to write something like: > > In [25]: iterable=((i, i**2) for i in range(10)) > > In [26]: a=np.fromiter(iterable, int32) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call > last) > in () > ----> 1 a=np.fromiter(iterable, int32) > > ValueError: setting an array element with a sequence. > > > Is there an efficient way to do this? > Perhaps you could just utilize structured arrays ( http://docs.scipy.org/doc/numpy/user/basics.rec.html), like: iterable= ((i, i**2) for i in range(10)) a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10) a.view(int32).reshape(-1, 2) Out[]: array([[ 0, 0], [ 1, 1], [ 2, 4], [ 3, 9], [ 4, 16], [ 5, 25], [ 6, 36], [ 7, 49], [ 8, 64], [ 9, 81]]) My 2 cents, -eat > > Creating two 1-dimensional arrays first is costly as one has to > iterate twice over the data. So the only way I see is creating an > empty [10,2] array and filling it row by row. This is memory-efficient > but slow. List comprehension is vice versa. > > If there is no solution, wouldn't it be possible to rewrite fromiter > so as to accept sequences? > > Leo > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Tue Jan 21 07:09:39 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Tue, 21 Jan 2014 12:09:39 +0000 Subject: [Numpy-discussion] Creating an ndarray from an iterable over sequences In-Reply-To: <52DE14EB.8010305@gmail.com> References: <52DE14EB.8010305@gmail.com> Message-ID: <20140121120937.GE2632@gmail.com> On Tue, Jan 21, 2014 at 07:34:19AM +0100, Dr. Leo wrote: > Hi, > > I would like to write something like: > > In [25]: iterable=((i, i**2) for i in range(10)) > > In [26]: a=np.fromiter(iterable, int32) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call > last) > in () > ----> 1 a=np.fromiter(iterable, int32) > > ValueError: setting an array element with a sequence. > > > Is there an efficient way to do this? > > Creating two 1-dimensional arrays first is costly as one has to > iterate twice over the data. So the only way I see is creating an > empty [10,2] array and filling it row by row. This is memory-efficient > but slow. List comprehension is vice versa. You could use itertools: >>> from itertools import chain >>> g = ((i, i**2) for i in range(10)) >>> import numpy >>> numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2) array([[ 0, 0], [ 1, 1], [ 2, 4], [ 3, 9], [ 4, 16], [ 5, 25], [ 6, 36], [ 7, 49], [ 8, 64], [ 9, 81]], dtype=int32) Oscar From oscar.j.benjamin at gmail.com Tue Jan 21 07:30:08 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Tue, 21 Jan 2014 12:30:08 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140121111334.GC2632@gmail.com> Message-ID: <20140121123006.GF2632@gmail.com> On Tue, Jan 21, 2014 at 11:41:30AM +0000, Nathaniel Smith wrote: > On 21 Jan 2014 11:13, "Oscar Benjamin" wrote: > > If the Numpy array would manage the buffers itself then that per string > memory > > overhead would be eliminated in exchange for an 8 byte pointer and at > least 1 > > byte to represent the length of the string (assuming you can somehow use > > Pascal strings when short enough - null bytes cannot be used). This gives > an > > overhead of 9 bytes per string (or 5 on 32 bit). In this case you save > memory > > if the strings are more than 3 characters long and you get at least a 50% > > saving for strings longer than 9 characters. > > There are various optimisations possible as well. > > For ASCII strings of up to length 8, one could also use tagged pointers to > eliminate the lookaside buffer entirely. (Alignment rules mean that > pointers to allocated buffers always have the low bits zero; so you can > make a rule that if the low bit is set to one, then this means the > "pointer" itself should be interpreted as containing the string data; use > the spare bit in the other bytes to encode the length.) > > In some cases it may also make sense to let identical strings share > buffers, though this adds some overhead for reference counting and > interning. Would this new dtype have an opaque memory representation? What would happen in the following: >>> a = numpy.array(['CGA', 'GAT'], dtype='s') >>> memoryview(a) >>> with open('file', 'wb') as fout: ... a.tofile(fout) >>> with open('file', 'rb') as fin: ... a = numpy.fromfile(fin, dtype='s') Should there be a different function for creating such an array from reading a text file? Or would you just need to use fromiter: >>> with open('file', encoding='utf-8') as fin: ... a = numpy.fromiter(fin, dtype='s') >>> with open('file', encoding='utf-8') as fout: ... fout.writelines(line + '\n' for line in a) (Note that the above would not be reversible if the strings contain newlines) I think it Would be less confusing to use dtype='u' than dtype='U' in order to signify that it is an optimised form of the 'U' dtype as far as access from Python code is concerned? Calling it 's' only really makes sense if there is a plan to deprecate dtype='S'. How would it behave in Python 2? Would it return unicode strings there as well? Oscar From aldcroft at head.cfa.harvard.edu Tue Jan 21 07:54:21 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Tue, 21 Jan 2014 07:54:21 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris wrote: > > > > On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> >> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: >> >>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>> wrote: >>> > >>> > >>> > >>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>> oscar.j.benjamin at gmail.com> >>> > wrote: >>> >> >>> >> >>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>> charlesr.harris at gmail.com> >>> >> wrote: >>> >> > >>> >> > I think we may want something like PEP 393. The S datatype may be >>> the >>> >> > wrong place to look, we might want a modification of U instead so >>> as to >>> >> > transparently get the benefit of python strings. >>> >> >>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than >>> it >>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>> >> >>> >> Since str is immutable the maximum code point in the string can be >>> >> determined once when the string is created before anything else can >>> get a >>> >> pointer to the string buffer. >>> >> >>> >> Since it is opaque no one can rightly expect it to expose a particular >>> >> binary format so it is free to choose without compromising any >>> expected >>> >> semantics. >>> >> >>> >> If someone can call buffer on an array then the FSR is a semantic >>> change. >>> >> >>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>> characters >>> >> then it would have a one byte per char buffer. What then happens if >>> you put >>> >> a higher code point in? The buffer needs to be resized and the data >>> copied >>> >> over. But then what happens to any buffer objects or array views? >>> They would >>> >> be pointing at the old buffer from before the resize. Subsequent >>> >> modifications to the resized array would not show up in other views >>> and vice >>> >> versa. >>> >> >>> >> I don't think that this can be done transparently since users of a >>> numpy >>> >> array need to know about the binary representation. That's why I >>> suggest a >>> >> dtype that has an encoding. Only in that way can it consistently have >>> both a >>> >> binary and a text interface. >>> > >>> > >>> > I didn't say we should change the S type, but that we should have >>> something, >>> > say 's', that appeared to python as a string. I think if we want >>> transparent >>> > string interoperability with python together with a compressed >>> > representation, and I think we need both, we are going to have to deal >>> with >>> > the difficulties of utf-8. That means raising errors if the string >>> doesn't >>> > fit in the allotted size, etc. Mind, this is a workaround for the mass >>> of >>> > ascii data that is already out there, not a substitute for 'U'. >>> >>> If we're going to be taking that much trouble, I'd suggest going ahead >>> and adding a variable-length string type (where the array itself >>> contains a pointer to a lookaside buffer, maybe with an optimization >>> for stashing short strings directly). The fixed-length requirement is >>> pretty onerous for lots of applications (e.g., pandas always uses >>> dtype="O" for strings -- and that might be a good workaround for some >>> people in this thread for now). The use of a lookaside buffer would >>> also make it practical to resize the buffer when the maximum code >>> point changed, for that matter... >>> >> > The more I think about it, the more I think we may need to do that. Note > that dynd has ragged arrays and I think they are implemented as pointers to > buffers. The easy way for us to do that would be a specialization of object > arrays to string types only as you suggest. > Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that? - Tom > > > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jan 21 08:55:29 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jan 2014 06:55:29 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > > > On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> >> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> >>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: >>> >>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>>> wrote: >>>> > >>>> > >>>> > >>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>>> oscar.j.benjamin at gmail.com> >>>> > wrote: >>>> >> >>>> >> >>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>>> charlesr.harris at gmail.com> >>>> >> wrote: >>>> >> > >>>> >> > I think we may want something like PEP 393. The S datatype may be >>>> the >>>> >> > wrong place to look, we might want a modification of U instead so >>>> as to >>>> >> > transparently get the benefit of python strings. >>>> >> >>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str >>>> than it >>>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>>> >> >>>> >> Since str is immutable the maximum code point in the string can be >>>> >> determined once when the string is created before anything else can >>>> get a >>>> >> pointer to the string buffer. >>>> >> >>>> >> Since it is opaque no one can rightly expect it to expose a >>>> particular >>>> >> binary format so it is free to choose without compromising any >>>> expected >>>> >> semantics. >>>> >> >>>> >> If someone can call buffer on an array then the FSR is a semantic >>>> change. >>>> >> >>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>>> characters >>>> >> then it would have a one byte per char buffer. What then happens if >>>> you put >>>> >> a higher code point in? The buffer needs to be resized and the data >>>> copied >>>> >> over. But then what happens to any buffer objects or array views? >>>> They would >>>> >> be pointing at the old buffer from before the resize. Subsequent >>>> >> modifications to the resized array would not show up in other views >>>> and vice >>>> >> versa. >>>> >> >>>> >> I don't think that this can be done transparently since users of a >>>> numpy >>>> >> array need to know about the binary representation. That's why I >>>> suggest a >>>> >> dtype that has an encoding. Only in that way can it consistently >>>> have both a >>>> >> binary and a text interface. >>>> > >>>> > >>>> > I didn't say we should change the S type, but that we should have >>>> something, >>>> > say 's', that appeared to python as a string. I think if we want >>>> transparent >>>> > string interoperability with python together with a compressed >>>> > representation, and I think we need both, we are going to have to >>>> deal with >>>> > the difficulties of utf-8. That means raising errors if the string >>>> doesn't >>>> > fit in the allotted size, etc. Mind, this is a workaround for the >>>> mass of >>>> > ascii data that is already out there, not a substitute for 'U'. >>>> >>>> If we're going to be taking that much trouble, I'd suggest going ahead >>>> and adding a variable-length string type (where the array itself >>>> contains a pointer to a lookaside buffer, maybe with an optimization >>>> for stashing short strings directly). The fixed-length requirement is >>>> pretty onerous for lots of applications (e.g., pandas always uses >>>> dtype="O" for strings -- and that might be a good workaround for some >>>> people in this thread for now). The use of a lookaside buffer would >>>> also make it practical to resize the buffer when the maximum code >>>> point changed, for that matter... >>>> >>> >> The more I think about it, the more I think we may need to do that. Note >> that dynd has ragged arrays and I think they are implemented as pointers to >> buffers. The easy way for us to do that would be a specialization of object >> arrays to string types only as you suggest. >> > > Is this approach intended to be in *addition to* the latin-1 "s" type > originally proposed by Chris, or *instead of* that? > > Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Tue Jan 21 09:37:11 2014 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Tue, 21 Jan 2014 09:37:11 -0500 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris wrote: > > > > On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < > aldcroft at head.cfa.harvard.edu> wrote: > >> >> >> >> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> >>> >>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: >>>> >>>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>>>> wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>>>> oscar.j.benjamin at gmail.com> >>>>> > wrote: >>>>> >> >>>>> >> >>>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>>>> charlesr.harris at gmail.com> >>>>> >> wrote: >>>>> >> > >>>>> >> > I think we may want something like PEP 393. The S datatype may be >>>>> the >>>>> >> > wrong place to look, we might want a modification of U instead so >>>>> as to >>>>> >> > transparently get the benefit of python strings. >>>>> >> >>>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str >>>>> than it >>>>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>>>> >> >>>>> >> Since str is immutable the maximum code point in the string can be >>>>> >> determined once when the string is created before anything else can >>>>> get a >>>>> >> pointer to the string buffer. >>>>> >> >>>>> >> Since it is opaque no one can rightly expect it to expose a >>>>> particular >>>>> >> binary format so it is free to choose without compromising any >>>>> expected >>>>> >> semantics. >>>>> >> >>>>> >> If someone can call buffer on an array then the FSR is a semantic >>>>> change. >>>>> >> >>>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>>>> characters >>>>> >> then it would have a one byte per char buffer. What then happens if >>>>> you put >>>>> >> a higher code point in? The buffer needs to be resized and the data >>>>> copied >>>>> >> over. But then what happens to any buffer objects or array views? >>>>> They would >>>>> >> be pointing at the old buffer from before the resize. Subsequent >>>>> >> modifications to the resized array would not show up in other views >>>>> and vice >>>>> >> versa. >>>>> >> >>>>> >> I don't think that this can be done transparently since users of a >>>>> numpy >>>>> >> array need to know about the binary representation. That's why I >>>>> suggest a >>>>> >> dtype that has an encoding. Only in that way can it consistently >>>>> have both a >>>>> >> binary and a text interface. >>>>> > >>>>> > >>>>> > I didn't say we should change the S type, but that we should have >>>>> something, >>>>> > say 's', that appeared to python as a string. I think if we want >>>>> transparent >>>>> > string interoperability with python together with a compressed >>>>> > representation, and I think we need both, we are going to have to >>>>> deal with >>>>> > the difficulties of utf-8. That means raising errors if the string >>>>> doesn't >>>>> > fit in the allotted size, etc. Mind, this is a workaround for the >>>>> mass of >>>>> > ascii data that is already out there, not a substitute for 'U'. >>>>> >>>>> If we're going to be taking that much trouble, I'd suggest going ahead >>>>> and adding a variable-length string type (where the array itself >>>>> contains a pointer to a lookaside buffer, maybe with an optimization >>>>> for stashing short strings directly). The fixed-length requirement is >>>>> pretty onerous for lots of applications (e.g., pandas always uses >>>>> dtype="O" for strings -- and that might be a good workaround for some >>>>> people in this thread for now). The use of a lookaside buffer would >>>>> also make it practical to resize the buffer when the maximum code >>>>> point changed, for that matter... >>>>> >>>> >>> The more I think about it, the more I think we may need to do that. Note >>> that dynd has ragged arrays and I think they are implemented as pointers to >>> buffers. The easy way for us to do that would be a specialization of object >>> arrays to string types only as you suggest. >>> >> >> Is this approach intended to be in *addition to* the latin-1 "s" type >> originally proposed by Chris, or *instead of* that? >> >> > Well, that's open for discussion. The problem is to have something that is > both compact (latin-1) and interoperates transparently with python 3 > strings (utf-8). A latin-1 type would be easier to implement and would > probably be a better choice for something available in both python 2 and > python 3, but unless the python 3 developers come up with something clever > I don't see how to make it behave transparently as a string in python 3. > OTOH, it's not clear to me how to make utf-8 operate transparently with > python 2 strings, especially as the unicode representation choices in > python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 > is unlikely to be backported. The problem may be unsolvable in a completely > satisfactory way. > Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits). I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings. Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4? Thanks, Tom > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Tue Jan 21 09:43:31 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Tue, 21 Jan 2014 14:43:31 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: <20140121144329.GH2632@gmail.com> On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote: > > Well, that's open for discussion. The problem is to have something that is > both compact (latin-1) and interoperates transparently with python 3 > strings (utf-8). A latin-1 type would be easier to implement and would > probably be a better choice for something available in both python 2 and > python 3, but unless the python 3 developers come up with something clever > I don't see how to make it behave transparently as a string in python 3. > OTOH, it's not clear to me how to make utf-8 operate transparently with > python 2 strings, especially as the unicode representation choices in > python 2 are ucs-2 or ucs-4 On Python 2, unicode strings can operate transparently with byte strings: $ python Python 2.7.3 (default, Sep 26 2013, 20:03:06) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as bnp >>> import numpy as np >>> a = np.array([u'\xd5scar'], dtype='U') >>> a array([u'\xd5scar'], dtype='>> a[0] u'\xd5scar' >>> import sys >>> sys.stdout.encoding 'UTF-8' >>> print(a[0]) # Encodes as 'utf-8' ?scar >>> 'My name is %s' % a[0] # Decodes as ASCII u'My name is \xd5scar' >>> print('My name is %s' % a[0]) # Encodes as UTF-8 My name is ?scar This is no better worse than the rest of the Py2 text model. So if the new dtype always returns a unicode string under Py2 it should work (as well as the Py2 text model ever does). > and the python 3 work adding utf-16 and utf-8 > is unlikely to be backported. The problem may be unsolvable in a completely > satisfactory way. What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it always uses a fixed-width encoding. You can just use the CPython C-API to create the unicode strings. The simplest way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds and post-3.3 FSR formats. Unlike Python's str there isn't much need to be able to efficiently slice or index within the string array element. Indexing into the array to get the string requires creating a new object, so you may as well just decode from utf-8 at that point [it's big-O(num chars) either way]. There's no need to constrain it to fixed-width encodings like the FSR in which case utf-8 is clearly the best choice as: 1) It covers the whole unicode spectrum. 2) It uses 1 byte-per-char for ASCII. 3) UTF-8 is a big optimisation target for CPython (so it's fast). Oscar From charlesr.harris at gmail.com Tue Jan 21 09:48:11 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jan 2014 07:48:11 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > > > On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> >> >> >> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < >> aldcroft at head.cfa.harvard.edu> wrote: >> >>> >>> >>> >>> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < >>> charlesr.harris at gmail.com> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < >>>> charlesr.harris at gmail.com> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote: >>>>> >>>>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>>>>> wrote: >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>>>>> oscar.j.benjamin at gmail.com> >>>>>> > wrote: >>>>>> >> >>>>>> >> >>>>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>>>>> charlesr.harris at gmail.com> >>>>>> >> wrote: >>>>>> >> > >>>>>> >> > I think we may want something like PEP 393. The S datatype may >>>>>> be the >>>>>> >> > wrong place to look, we might want a modification of U instead >>>>>> so as to >>>>>> >> > transparently get the benefit of python strings. >>>>>> >> >>>>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str >>>>>> than it >>>>>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>>>>> >> >>>>>> >> Since str is immutable the maximum code point in the string can be >>>>>> >> determined once when the string is created before anything else >>>>>> can get a >>>>>> >> pointer to the string buffer. >>>>>> >> >>>>>> >> Since it is opaque no one can rightly expect it to expose a >>>>>> particular >>>>>> >> binary format so it is free to choose without compromising any >>>>>> expected >>>>>> >> semantics. >>>>>> >> >>>>>> >> If someone can call buffer on an array then the FSR is a semantic >>>>>> change. >>>>>> >> >>>>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>>>>> characters >>>>>> >> then it would have a one byte per char buffer. What then happens >>>>>> if you put >>>>>> >> a higher code point in? The buffer needs to be resized and the >>>>>> data copied >>>>>> >> over. But then what happens to any buffer objects or array views? >>>>>> They would >>>>>> >> be pointing at the old buffer from before the resize. Subsequent >>>>>> >> modifications to the resized array would not show up in other >>>>>> views and vice >>>>>> >> versa. >>>>>> >> >>>>>> >> I don't think that this can be done transparently since users of a >>>>>> numpy >>>>>> >> array need to know about the binary representation. That's why I >>>>>> suggest a >>>>>> >> dtype that has an encoding. Only in that way can it consistently >>>>>> have both a >>>>>> >> binary and a text interface. >>>>>> > >>>>>> > >>>>>> > I didn't say we should change the S type, but that we should have >>>>>> something, >>>>>> > say 's', that appeared to python as a string. I think if we want >>>>>> transparent >>>>>> > string interoperability with python together with a compressed >>>>>> > representation, and I think we need both, we are going to have to >>>>>> deal with >>>>>> > the difficulties of utf-8. That means raising errors if the string >>>>>> doesn't >>>>>> > fit in the allotted size, etc. Mind, this is a workaround for the >>>>>> mass of >>>>>> > ascii data that is already out there, not a substitute for 'U'. >>>>>> >>>>>> If we're going to be taking that much trouble, I'd suggest going ahead >>>>>> and adding a variable-length string type (where the array itself >>>>>> contains a pointer to a lookaside buffer, maybe with an optimization >>>>>> for stashing short strings directly). The fixed-length requirement is >>>>>> pretty onerous for lots of applications (e.g., pandas always uses >>>>>> dtype="O" for strings -- and that might be a good workaround for some >>>>>> people in this thread for now). The use of a lookaside buffer would >>>>>> also make it practical to resize the buffer when the maximum code >>>>>> point changed, for that matter... >>>>>> >>>>> >>>> The more I think about it, the more I think we may need to do that. >>>> Note that dynd has ragged arrays and I think they are implemented as >>>> pointers to buffers. The easy way for us to do that would be a >>>> specialization of object arrays to string types only as you suggest. >>>> >>> >>> Is this approach intended to be in *addition to* the latin-1 "s" type >>> originally proposed by Chris, or *instead of* that? >>> >>> >> Well, that's open for discussion. The problem is to have something that >> is both compact (latin-1) and interoperates transparently with python 3 >> strings (utf-8). A latin-1 type would be easier to implement and would >> probably be a better choice for something available in both python 2 and >> python 3, but unless the python 3 developers come up with something clever >> I don't see how to make it behave transparently as a string in python 3. >> OTOH, it's not clear to me how to make utf-8 operate transparently with >> python 2 strings, especially as the unicode representation choices in >> python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 >> is unlikely to be backported. The problem may be unsolvable in a completely >> satisfactory way. >> > > Since it's open for discussion, I'll put in my vote for implementing the > easier latin-1 version in the short term to facilitate Python 2 / 3 > interoperability. This would solve my use-case (giga-rows of short fixed > length strings), and presumably allow things like memory mapping of large > data files (like for FITS files in astropy.io.fits). > > I don't have a clue how the current 'U' dtype works under the hood, but > from my user perspective it seems to work just fine in terms of interacting > with Python 3 strings. Is there a technical problem with doing basically > the same thing for an 's' dtype, but using latin-1 instead of UCS-4? > I think there is a technical problem. We may be able masquerade latin-1 as utf-8 for some subset of characters or fool python 3 in some other way. But in anycase, I think it needs some research to see what the possibilities are. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jan 21 10:10:01 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 21 Jan 2014 16:10:01 +0100 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: <20140120101113.GA2178@gmail.com> <20140120154039.GD2178@gmail.com> Message-ID: <1390317001.25697.7.camel@sebastian-laptop> On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote: > > > > On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas > wrote: > > > > On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris > wrote: > > > > On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas > wrote: > > > > On Mon, Jan 20, 2014 at 6:12 PM, Charles R > Harris wrote: > > > > On Mon, Jan 20, 2014 at 3:58 PM, > Charles R Harris > wrote: > > > > On Mon, Jan 20, 2014 at 3:35 > PM, Nathaniel Smith > wrote: > On Mon, Jan 20, 2014 > at 10:28 PM, Charles R > Harris > wrote: > > > > > > > > On Mon, Jan 20, 2014 > at 2:27 PM, Oscar > Benjamin > > > wrote: > >> > >> > >> On Jan 20, 2014 > 8:35 PM, "Charles R > Harris" > > >> wrote: > >> > > >> > I think we may > want something like > PEP 393. The S > datatype may be the > >> > wrong place to > look, we might want a > modification of U > instead so as to > >> > transparently get > the benefit of python > strings. > >> > >> The approach taken > in PEP 393 (the FSR) > makes more sense for > str than it > >> does for numpy > arrays for two > reasons: str is > immutable and opaque. > >> > >> Since str is > immutable the maximum > code point in the > string can be > >> determined once > when the string is > created before > anything else can get > a > >> pointer to the > string buffer. > >> > >> Since it is opaque > no one can rightly > expect it to expose a > particular > >> binary format so it > is free to choose > without compromising > any expected > >> semantics. > >> > >> If someone can call > buffer on an array > then the FSR is a > semantic change. > >> > >> If a numpy 'U' > array used the FSR and > consisted only of > ASCII characters > >> then it would have > a one byte per char > buffer. What then > happens if you put > >> a higher code point > in? The buffer needs > to be resized and the > data copied > >> over. But then what > happens to any buffer > objects or array > views? They would > >> be pointing at the > old buffer from before > the resize. Subsequent > >> modifications to > the resized array > would not show up in > other views and vice > >> versa. > >> > >> I don't think that > this can be done > transparently since > users of a numpy > >> array need to know > about the binary > representation. That's > why I suggest a > >> dtype that has an > encoding. Only in that > way can it > consistently have both > a > >> binary and a text > interface. > > > > > > I didn't say we > should change the S > type, but that we > should have something, > > say 's', that > appeared to python as > a string. I think if > we want transparent > > string > interoperability with > python together with a > compressed > > representation, and > I think we need both, > we are going to have > to deal with > > the difficulties of > utf-8. That means > raising errors if the > string doesn't > > fit in the allotted > size, etc. Mind, this > is a workaround for > the mass of > > ascii data that is > already out there, not > a substitute for 'U'. > > > If we're going to be > taking that much > trouble, I'd suggest > going ahead > and adding a > variable-length string > type (where the array > itself > contains a pointer to > a lookaside buffer, > maybe with an > optimization > for stashing short > strings directly). The > fixed-length > requirement is > pretty onerous for > lots of applications > (e.g., pandas always > uses > dtype="O" for strings > -- and that might be a > good workaround for > some > people in this thread > for now). The use of a > lookaside buffer would > also make it practical > to resize the buffer > when the maximum code > point changed, for > that matter... > > > The more I think about it, the more I > think we may need to do that. Note > that dynd has ragged arrays and I > think they are implemented as pointers > to buffers. The easy way for us to do > that would be a specialization of > object arrays to string types only as > you suggest. > > > > Is this approach intended to be in *addition > to* the latin-1 "s" type originally proposed > by Chris, or *instead of* that? > > > > > Well, that's open for discussion. The problem is to > have something that is both compact (latin-1) and > interoperates transparently with python 3 strings > (utf-8). A latin-1 type would be easier to implement > and would probably be a better choice for something > available in both python 2 and python 3, but unless > the python 3 developers come up with something clever > I don't see how to make it behave transparently as a > string in python 3. OTOH, it's not clear to me how to > make utf-8 operate transparently with python 2 > strings, especially as the unicode representation > choices in python 2 are ucs-2 or ucs-4 and the python > 3 work adding utf-16 and utf-8 is unlikely to be > backported. The problem may be unsolvable in a > completely satisfactory way. > > > > Since it's open for discussion, I'll put in my vote for > implementing the easier latin-1 version in the short term to > facilitate Python 2 / 3 interoperability. This would solve my > use-case (giga-rows of short fixed length strings), and > presumably allow things like memory mapping of large data > files (like for FITS files in astropy.io.fits). > > > I don't have a clue how the current 'U' dtype works under the > hood, but from my user perspective it seems to work just fine > in terms of interacting with Python 3 strings. Is there a > technical problem with doing basically the same thing for an > 's' dtype, but using latin-1 instead of UCS-4? > > > I think there is a technical problem. We may be able masquerade > latin-1 as utf-8 for some subset of characters or fool python 3 in > some other way. But in anycase, I think it needs some research to see > what the possibilities are. > I am not quite sure, but shouldn't it be even possible to tag on a possible encoding into the metadata of the string dtype and allow this to be set to all 1-byte wide encodings that python understands. If the metadata is not None, all entry points to and from the array (Object->string, string->Object conversions) would then de- or encode using the usual python string de- and encode. Of course it would still be a lot of work, since the string comparisons would need to know about comparing different encodings and dtype equivalence is wrong and all the conversions need to be carefully checked... Most string tools though probably don't care about encoding as long as it is fixed 1-byte width, though one would have to check that they don't lose the encoding information by creating a new "S" array... - Sebastian > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jenny.stone125 at gmail.com Tue Jan 21 11:26:17 2014 From: jenny.stone125 at gmail.com (jennifer stone) Date: Tue, 21 Jan 2014 21:56:17 +0530 Subject: [Numpy-discussion] (no subject) Message-ID: > >What are your interests and experience? If you use numpy, are there things > >you would like to fix, or enhancements you would like to see? > > Chuck > > I am an undergraduate student with CS as major and have interest in Math and Physics. This has led me to use NumPy and SciPy to work on innumerable cases involving special polynomial functions and polynomials like Legendre polynomials, Bessel Functions and so on. So, The packages are closer known to me from this point of view. I have a* few proposals* in mind. But I don't have any idea if they are acceptable within the scope of GSoC 1. Many special functions and polynomials are neither included in NumPy nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), Cylindrical Harmonic function. Scipy at present supports only spherical Harmonic function. Further, why cant we extend SciPy to incorporate* Inverse Laplace Transforms*? At present Matlab has this amazing function *ilaplace* and SymPy does have *Inverse_Laplace_transform* but it would be better to incorporate all in one package. I mean SciPy does have function to evaluate laplace transform After having written this, I feel that this post should have been sent to SciPy but as a majority of contributors are the same I proceed. Please suggest any other possible projects, as I would like to continue with SciPy or NumPy, preferably NumPy as I have been fiddling with its source code for a month now and so am pretty comfortable with it. As for my experience, I have known C for past 4 years and have been a python lover for past 1 year. I am pretty new to open source communities, started before a manth and a half. regards Jennifer -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan at sun.ac.za Tue Jan 21 11:45:31 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Tue, 21 Jan 2014 17:45:31 +0100 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: <20140121164531.GB21126@gmail.com> On Tue, 21 Jan 2014 21:56:17 +0530, jennifer stone wrote: > I am an undergraduate student with CS as major and have interest in Math > and Physics. This has led me to use NumPy and SciPy to work on innumerable > cases involving special polynomial functions and polynomials like Legendre > polynomials, Bessel Functions and so on. So, The packages are closer known > to me from this point of view. I have a* few proposals* in mind. But I > don't have any idea if they are acceptable within the scope of GSoC > 1. Many special functions and polynomials are neither included in NumPy nor > on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), > Cylindrical Harmonic function. Scipy at present supports only spherical > Harmonic function. SciPy's spherical harmonics are very inefficient if one is only interested in computing one specific order. I'd be so happy if someone would work on that! St?fan From charlesr.harris at gmail.com Tue Jan 21 11:46:36 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jan 2014 09:46:36 -0700 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone wrote: > > >What are your interests and experience? If you use numpy, are there things >> >you would like to fix, or enhancements you would like to see? >> >> Chuck >> >> > I am an undergraduate student with CS as major and have interest in Math > and Physics. This has led me to use NumPy and SciPy to work on innumerable > cases involving special polynomial functions and polynomials like Legendre > polynomials, Bessel Functions and so on. So, The packages are closer known > to me from this point of view. I have a* few proposals* in mind. But I > don't have any idea if they are acceptable within the scope of GSoC > 1. Many special functions and polynomials are neither included in NumPy > nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames > function), Cylindrical Harmonic function. Scipy at present supports only > spherical Harmonic function. > Further, why cant we extend SciPy to incorporate* Inverse Laplace > Transforms*? At present Matlab has this amazing function *ilaplace* and > SymPy does have *Inverse_Laplace_transform* but it would be better to > incorporate all in one package. I mean SciPy does have function to evaluate > laplace transform > > After having written this, I feel that this post should have been sent to > SciPy > but as a majority of contributors are the same I proceed. > Please suggest any other possible projects, as I would like to continue > with SciPy or NumPy, preferably NumPy as I have been fiddling with its > source code for a month now and so am pretty comfortable with it. > > As for my experience, I have known C for past 4 years and have been a > python lover for past 1 year. I am pretty new to open source communities, > started before a manth and a half. > > It does sound like scipy might be a better match, I don't think anyone would complain if you cross posted. Both scipy and numpy require GSOC candidates to have a pull request accepted as part of the application process. I'd suggest implementing a function not currently in scipy that you think would be useful. That would also help in finding a mentor for the summer. I'd also suggest getting familiar with cython. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jan 21 12:03:02 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jan 2014 10:03:02 -0700 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 9:46 AM, Charles R Harris wrote: > > > > On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone wrote: > >> >> >What are your interests and experience? If you use numpy, are there >>> things >>> >you would like to fix, or enhancements you would like to see? >>> >>> Chuck >>> >>> >> I am an undergraduate student with CS as major and have interest in Math >> and Physics. This has led me to use NumPy and SciPy to work on innumerable >> cases involving special polynomial functions and polynomials like Legendre >> polynomials, Bessel Functions and so on. So, The packages are closer known >> to me from this point of view. I have a* few proposals* in mind. But I >> don't have any idea if they are acceptable within the scope of GSoC >> 1. Many special functions and polynomials are neither included in NumPy >> nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames >> function), Cylindrical Harmonic function. Scipy at present supports only >> spherical Harmonic function. >> Further, why cant we extend SciPy to incorporate* Inverse Laplace >> Transforms*? At present Matlab has this amazing function *ilaplace* and >> SymPy does have *Inverse_Laplace_transform* but it would be better to >> incorporate all in one package. I mean SciPy does have function to evaluate >> laplace transform >> >> After having written this, I feel that this post should have been sent to >> SciPy >> but as a majority of contributors are the same I proceed. >> Please suggest any other possible projects, as I would like to continue >> with SciPy or NumPy, preferably NumPy as I have been fiddling with its >> source code for a month now and so am pretty comfortable with it. >> >> As for my experience, I have known C for past 4 years and have been a >> python lover for past 1 year. I am pretty new to open source communities, >> started before a manth and a half. >> >> > It does sound like scipy might be a better match, I don't think anyone > would complain if you cross posted. Both scipy and numpy require GSOC > candidates to have a pull request accepted as part of the application > process. I'd suggest implementing a function not currently in scipy that > you think would be useful. That would also help in finding a mentor for the > summer. I'd also suggest getting familiar with cython. > > I don't see you on github yet, are you there? If not, you should set up an account to work in. See the developer guide for some pointers. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Tue Jan 21 12:28:19 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Tue, 21 Jan 2014 09:28:19 -0800 Subject: [Numpy-discussion] A one-byte string dtype? Message-ID: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Jan 21 12:35:26 2014 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 21 Jan 2014 17:35:26 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: On 21 Jan 2014 17:28, "David Goldsmith" wrote: > > > Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Jan 21 12:46:41 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 21 Jan 2014 09:46:41 -0800 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith wrote: > > Am I the only one who feels that this (very important--I'm being sincere, > not sarcastic) thread has matured and specialized enough to warrant it's > own home on the Wiki? > Or maybe a NEP? https://github.com/numpy/numpy/tree/master/doc/neps sorry -- really swamped this week, so I won't be writing it... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Tue Jan 21 12:53:25 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Tue, 21 Jan 2014 09:53:25 -0800 Subject: [Numpy-discussion] A one-byte string dtype? Message-ID: > Date: Tue, 21 Jan 2014 17:35:26 +0000 > From: Nathaniel Smith > Subject: Re: [Numpy-discussion] A one-byte string dtype? > To: Discussion of Numerical Python > Message-ID: > KE3xLGa2+gz+Qd4F0xS2UBoEYsgedA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On 21 Jan 2014 17:28, "David Goldsmith" wrote: > > > > > > Am I the only one who feels that this (very important--I'm being sincere, > not sarcastic) thread has matured and specialized enough to warrant it's > own home on the Wiki? > > Sounds plausible, perhaps you could write up such a page? > > -n > I can certainly get one started (but I don't think I can faithfully summarize all this thread's current content, so I apologize in advance for leaving that undone). DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Jan 21 13:00:19 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 21 Jan 2014 10:00:19 -0800 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals: 1) a one-byte-per-char dtype: This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc.... NOTE: This is NOT a way to conflate bytes and text, and not a way to "go back to the py2 mojibake hell" -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible. 2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice. 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise. Hmm -- I guess despite what I said, I just write the starting pint for a NEP... (or two, actually...) -Chris On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker wrote: > On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith wrote: > >> >> Am I the only one who feels that this (very important--I'm being sincere, >> not sarcastic) thread has matured and specialized enough to warrant it's >> own home on the Wiki? >> > > Or maybe a NEP? > > https://github.com/numpy/numpy/tree/master/doc/neps > > sorry -- really swamped this week, so I won't be writing it... > > -Chris > > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jan 21 13:14:28 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 21 Jan 2014 11:14:28 -0700 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker wrote: > A lot of good discussion here -- to much to comment individually, but it > seems we can boil it down to a couple somewhat distinct proposals: > > 1) a one-byte-per-char dtype: > > This would provide compact, high efficiency storage for common text > for scientific computing. It is analogous to a lower-precision numeric type > -- i.e. it could not store any unicode strings -- only the subset that are > compatible the suggested encoding. > Suggested encoding: latin-1 > Other options: > - ascii only. > - settable to any one-byte per char encoding supported by python > I like this IFF it's pretty easy, but it may > add significant complications (and overhead) for comparisons, etc.... > > NOTE: This is NOT a way to conflate bytes and text, and not a way to "go > back to the py2 mojibake hell" -- the goal here is to very clearly have > this be text data, and have a clearly defined encoding. Which is why we > can't just use 'S' -- or adapt 'S' to do this. Rather is is a way > to conveniently and efficiently use numpy for text that is ansi compatible. > > 2) a utf-8 dtype: > NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte > per char encoding, so would not snuggly into the numpy data model. > It would give compact memory use for mostly-ascii data, so that would > be nice. > > 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. > This would get us the advantages of the new py3 unicode model -- compact > and efficient when it can be, but also supporting all of unicode. Honestly, > this seems like more work than it's worth to me, at least given the current > numpy dtype model -- maybe a nice addition to dynd. YOu can, after > all, simply use an object array with py3 strings in it. Though perhaps > using the py3 unicode type, but having a dtype that specifically links to > that, rather than a generic python object would be a good compromise. > > > Hmm -- I guess despite what I said, I just write the starting pint for a > NEP... > > Should also mention the reasons for adding a new data type. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Tue Jan 21 13:34:38 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Tue, 21 Jan 2014 10:34:38 -0800 Subject: [Numpy-discussion] A one-byte string dtype? Message-ID: On Tue, Jan 21, 2014 at 10:00 AM, wrote: > Date: Tue, 21 Jan 2014 09:53:25 -0800 > From: David Goldsmith > Subject: Re: [Numpy-discussion] A one-byte string dtype? > To: numpy-discussion at scipy.org > Message-ID: > 7aLTPXMrz4MiujY2XEbyi_fY5WWw at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > > Date: Tue, 21 Jan 2014 17:35:26 +0000 > > From: Nathaniel Smith > > Subject: Re: [Numpy-discussion] A one-byte string dtype? > > To: Discussion of Numerical Python > > Message-ID: > > > KE3xLGa2+gz+Qd4F0xS2UBoEYsgedA at mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > On 21 Jan 2014 17:28, "David Goldsmith" wrote: > > > > > > > > > Am I the only one who feels that this (very important--I'm being > sincere, > > not sarcastic) thread has matured and specialized enough to warrant it's > > own home on the Wiki? > > > > Sounds plausible, perhaps you could write up such a page? > > > > -n > > > > I can certainly get one started (but I don't think I can faithfully > summarize all this thread's current content, so I apologize in advance for > leaving that undone). > > DG > OK, I'm "lost" already: is there general agreement that this should "jump" straight to one or more NEP's? If not (or if there should be a Wiki page for it additionally), should such become part of the NumPy Wiki @ Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is one's SciPy Wiki login the same as one's mailing list subscriber maintenance login? I guess starting such a page is not as trivial as I had assumed. DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jan 21 14:20:12 2014 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 21 Jan 2014 19:20:12 +0000 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 6:34 PM, David Goldsmith wrote: >> I can certainly get one started (but I don't think I can faithfully >> summarize all this thread's current content, so I apologize in advance for >> leaving that undone). >> >> DG > > OK, I'm "lost" already: is there general agreement that this should "jump" straight to one or more NEP's? If not (or if there should be a Wiki page for it additionally), should such become part of the NumPy Wiki @ Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is one's SciPy Wiki login the same as one's mailing list subscriber maintenance login? I guess starting such a page is not as trivial as I had assumed. The wiki is frozen. Please do not add anything to it. It plays no role in our current development workflow. Drafting a NEP or two and iterating on them would be the next step. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew.collette at gmail.com Tue Jan 21 18:22:50 2014 From: andrew.collette at gmail.com (Andrew Collette) Date: Tue, 21 Jan 2014 16:22:50 -0700 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: Hi Chris, Just stumbled on this discussion (I'm the lead author of h5py). We would be overjoyed if there were a 1-byte text type available in NumPy. String handling is the source of major pain right now in the HDF5 world. All HDF5 strings are text (opaque types are used for binary data), but we're forced into using the "S" type most of the time because (1) the "U" type doesn't round-trip between HDF5 and NumPy, as there's no fixed-width wide-character string type in HDF5, and (2) "U" takes 4x the space, which is a problem for big scientific datasets. ASCII-only would be preferable, partly for selfish reasons (HDF5's default is ASCII only), and partly to make it possible to copy them into containers labelled "UTF-8" without manually inspecting every value. > """At the high-level interface, h5py exposes three kinds of strings. Each > maps to a specific type within Python (but see str_py3 below): > > Fixed-length ASCII (NumPy S type) > .... > """ > This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not > an ASCII string (even though I wish it were...). But clearly the HDF folsk > think we need one! Yes, this was intended to state that the HDF5 "Fixed-width ASCII" type maps to NumPy "S" at conversion time, which is obviously a wretched solution on Py3. >>>> dset = f.create_dataset("string_ds", (100,), dtype="S10") > """ > Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form > another post, I thought you'd need to use numpy.bytes_ (which is the same on > py2) It does produce an instance of 'numpy.bytes_', although I think the h5py docs should be changed to use bytes_ explicitly. Andrew From chris.barker at noaa.gov Tue Jan 21 19:30:23 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 21 Jan 2014 16:30:23 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: On Tue, Jan 21, 2014 at 3:22 PM, Andrew Collette wrote: > Just stumbled on this discussion (I'm the lead author of h5py). > > We would be overjoyed if there were a 1-byte text type available in > NumPy. cool -- it looks like someone is going to get a draft PEP going -- so stay tuned, and add you comments when there is something to add them too.. String handling is the source of major pain right now in the > HDF5 world. All HDF5 strings are text (opaque types are used for > binary data), but we're forced into using the "S" type most of the > time because (1) the "U" type doesn't round-trip between HDF5 and > NumPy, as there's no fixed-width wide-character string type in HDF5, > it looks from here: http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a lot of calls to encode/decode -- which could be pretty slow, compared to other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by "doesn't round trip". This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary encoding dtype, anyway). But: How does hdf handle the fact that utf-8 is not a fixed length encoding? ASCII-only would be preferable, partly for selfish reasons (HDF5's > default is ASCII only), and partly to make it possible to copy them > into containers labelled "UTF-8" without manually inspecting every > value. > hmm -- ascii does have those advantages, but I'm not sure its worth the restriction on what can be encoded. But you're quite right, you could dump asciii straight into something expecting utf-8, whereas you could not do that with latin-1, for instance. But you can't go the other way -- does it help much to avoided encoding in one direction? But maybe we can have a any-one-byte-per-char encoding option, in which case hdfpy could use ascii, but we wouldn't have to everywhere. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.l.goldsmith at gmail.com Tue Jan 21 19:58:30 2014 From: d.l.goldsmith at gmail.com (David Goldsmith) Date: Tue, 21 Jan 2014 16:58:30 -0800 Subject: [Numpy-discussion] A one-byte string dtype? Message-ID: Date: Tue, 21 Jan 2014 19:20:12 +0000 > From: Robert Kern > Subject: Re: [Numpy-discussion] A one-byte string dtype? > > The wiki is frozen. Please do not add anything to it. It plays no role in > our current development workflow. Drafting a NEP or two and iterating on > them would be the next step. > > -- > Robert Kern > OK, well that's definitely beyond my level of expertise. DG -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Jan 21 20:46:42 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 21 Jan 2014 17:46:42 -0800 Subject: [Numpy-discussion] A one-byte string dtype? In-Reply-To: References: Message-ID: <-3093838165213224616@unknownmsgid> On Jan 21, 2014, at 4:58 PM, David Goldsmith wrote: > > OK, well that's definitely beyond my level of expertise. Well, it's in github--now's as good a time as any to learn github collaboration... -Fork the numpy source. -Create a new file in: numpy/doc/neps Point folks to it here so they can comment, etc. At some point, issue a pull request, and it can get merged into the main source for final polishing... -Chris > > DG > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From andrew.collette at gmail.com Tue Jan 21 20:54:33 2014 From: andrew.collette at gmail.com (Andrew Collette) Date: Tue, 21 Jan 2014 18:54:33 -0700 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D68161.7090807@googlemail.com> <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: Hi Chris, > it looks from here: > http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html > > that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a > lot of calls to encode/decode -- which could be pretty slow, compared to > other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by > "doesn't round trip". HDF5 does have variable-length string support for UTF-8, so we map that directly to the unicode type (str on Py3) exactly as you describe, by encoding when we write to the file. But there's no way to round-trip with *fixed-width* strings. You can go from e.g. a 10 byte ASCII string to "U10", but going the other way fails if there are characters which take more than 1 byte to represent. We don't always get to choose the destination type, when e.g. writing into an existing dataset, so we can't always write vlen strings. > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary > encoding dtype, anyway). > But: How does hdf handle the fact that utf-8 is not a fixed length encoding? With fixed-width strings it doesn't, really. If you use vlen strings it's fine, but otherwise there's just a fixed-width buffer labelled "UTF-8". Presumably you're supposed to be careful when writing not to chop the string off in the middle of a multibyte character. We could truncate strings on their way to the file, but the risk of data loss/corruption led us to simply not support it at all. > hmm -- ascii does have those advantages, but I'm not sure its worth the > restriction on what can be encoded. But you're quite right, you could dump > asciii straight into something expecting utf-8, whereas you could not do > that with latin-1, for instance. But you can't go the other way -- does it > help much to avoided encoding in one direction? It would help for h5py specifically because most HDF5 strings are labelled "ASCII". But it's a question for the community which is more important: the high-bit characters in latin-1, or write-compatibility with UTF-8. Andrew From fhaxbox66 at googlemail.com Wed Jan 22 01:58:27 2014 From: fhaxbox66 at googlemail.com (Dr. Leo) Date: Wed, 22 Jan 2014 07:58:27 +0100 Subject: [Numpy-discussion] fromiter cannot create array of object - was: Creating an ndarray from an iterable, over sequences In-Reply-To: References: Message-ID: <52DF6C13.8030400@gmail.com> Hi, thanks. Both recarray and itertools.chain work just fine in the example case. However, the real purpose of this is to read strings from a large xml file into a pandas DataFrame. But fromiter cannot create arrays of dtype 'object'. Fixed length strings may be worth trying. But as the xml schema does not guarantee a max. length, and pandas generally uses 'object' arrays for strings, I see no better way than creating the array through list comprehensions and turn it into a DataFrame. Maybe a variable length string/unicode type would help in the long term. Leo > > I would like to write something like: > > In [25]: iterable=((i, i**2) for i in range(10)) > > In [26]: a=np.fromiter(iterable, int32) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call > last) > in () > ----> 1 a=np.fromiter(iterable, int32) > > ValueError: setting an array element with a sequence. > > > Is there an efficient way to do this? > Perhaps you could just utilize structured arrays ( http://docs.scipy.org/doc/numpy/user/basics.rec.html), like: iterable= ((i, i**2) for i in range(10)) a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10) a.view(int32).reshape(-1, 2) You could use itertools: >>> from itertools import chain >>> g = ((i, i**2) for i in range(10)) >>> import numpy >>> numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2) From oscar.j.benjamin at gmail.com Wed Jan 22 05:46:49 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Wed, 22 Jan 2014 10:46:49 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> Message-ID: <20140122104646.GA2555@gmail.com> On Tue, Jan 21, 2014 at 06:54:33PM -0700, Andrew Collette wrote: > Hi Chris, > > > it looks from here: > > http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html > > > > that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a > > lot of calls to encode/decode -- which could be pretty slow, compared to > > other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by > > "doesn't round trip". > > HDF5 does have variable-length string support for UTF-8, so we map > that directly to the unicode type (str on Py3) exactly as you > describe, by encoding when we write to the file. But there's no way > to round-trip with *fixed-width* strings. You can go from e.g. a 10 > byte ASCII string to "U10", but going the other way fails if there are > characters which take more than 1 byte to represent. We don't always > get to choose the destination type, when e.g. writing into an existing > dataset, so we can't always write vlen strings. Is it fair to say that people should really be using vlen utf-8 strings for text? Is it problematic because of the need to interface with non-Python libraries using the same hdf5 file? > > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary > > encoding dtype, anyway). That's what I was thinking. A ragged utf-8 array could map to an array of vlen strings. Or am I misunderstanding how hdf5 works? Looking here: http://www.h5py.org/docs/topics/special.html ''' HDF5 supports a few types which have no direct NumPy equivalent. Among the most useful and widely used are variable-length (VL) types, and enumerated types. As of version 1.2, h5py fully supports HDF5 enums, and has partial support for VL types. ''' So that seems to suggests that h5py already has a use for a variable length string dtype. BTW, as much as the fixed-width 'S' dtype doesn't really work for str in Python 3 it's also a poor fit for bytes since it strips trailing nulls: >>> a = np.array(['a\0s\0', 'qwert'], dtype='S') >>> a array([b'a\x00s', b'qwert'], dtype='|S5') >>> a[0] b'a\x00s' > > But: How does hdf handle the fact that utf-8 is not a fixed length encoding? > > With fixed-width strings it doesn't, really. If you use vlen strings > it's fine, but otherwise there's just a fixed-width buffer labelled > "UTF-8". Presumably you're supposed to be careful when writing not to > chop the string off in the middle of a multibyte character. We could > truncate strings on their way to the file, but the risk of data > loss/corruption led us to simply not support it at all. Truncating utf-8 is never a good idea. Throwing an error message when it would truncate is okay though. Presumably you already do this when someone tries to assign an ASCII string that's too long right? Oscar From sebastian at sipsolutions.net Wed Jan 22 06:13:00 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 22 Jan 2014 12:13:00 +0100 Subject: [Numpy-discussion] fromiter cannot create array of object - was: Creating an ndarray from an iterable, over sequences In-Reply-To: <52DF6C13.8030400@gmail.com> References: <52DF6C13.8030400@gmail.com> Message-ID: <1390389180.31254.1.camel@sebastian-laptop> On Wed, 2014-01-22 at 07:58 +0100, Dr. Leo wrote: > Hi, > > thanks. Both recarray and itertools.chain work just fine in the example > case. > > However, the real purpose of this is to read strings from a large xml > file into a pandas DataFrame. But fromiter cannot create arrays of dtype > 'object'. Fixed length strings may be worth trying. But as the xml > schema does not guarantee a max. length, and pandas generally uses > 'object' arrays for strings, I see no better way than creating the array > through list comprehensions and turn it into a DataFrame. If your datatype is object, I doubt that using an intermediate list is a real overhead, since the list will use much less memory then the string objects anyway. - Sebastian > > Maybe a variable length string/unicode type would help in the long term. > > Leo > > > > > > I would like to write something like: > > > > In [25]: iterable=((i, i**2) for i in range(10)) > > > > In [26]: a=np.fromiter(iterable, int32) > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call > > last) > > in () > > ----> 1 a=np.fromiter(iterable, int32) > > > > ValueError: setting an array element with a sequence. > > > > > > Is there an efficient way to do this? > > > Perhaps you could just utilize structured arrays ( > http://docs.scipy.org/doc/numpy/user/basics.rec.html), like: > iterable= ((i, i**2) for i in range(10)) > a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10) > a.view(int32).reshape(-1, 2) > > You could use itertools: > > >>> from itertools import chain > >>> g = ((i, i**2) for i in range(10)) > >>> import numpy > >>> numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2) > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From Ralf.Juengling at synopsys.com Wed Jan 22 12:23:35 2014 From: Ralf.Juengling at synopsys.com (Ralf Juengling) Date: Wed, 22 Jan 2014 17:23:35 +0000 Subject: [Numpy-discussion] accumulation operation Message-ID: <1150F30B0E0E7844ABF5FB64D0A8FA58240B5D17@US01WEMBX2.internal.synopsys.com> Executing the following code, >>> import numpy as np >>> a = np.zeros((3,)) >>> w = np.array([0, 1, 0, 1, 2]) >>> v = np.array([10.0, 1, 10.0, 2, 9]) >>> a[w] += v I was expecting 'a' to be array([20., 3., 9.]. Instead I get >>> a array([ 10., 2., 9.]) This with numpy version 1.6.1. Is there another way to do the accumulation I want? Thanks, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Jan 22 12:32:15 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 22 Jan 2014 18:32:15 +0100 Subject: [Numpy-discussion] accumulation operation In-Reply-To: <1150F30B0E0E7844ABF5FB64D0A8FA58240B5D17@US01WEMBX2.internal.synopsys.com> References: <1150F30B0E0E7844ABF5FB64D0A8FA58240B5D17@US01WEMBX2.internal.synopsys.com> Message-ID: <1390411935.31254.4.camel@sebastian-laptop> On Wed, 2014-01-22 at 17:23 +0000, Ralf Juengling wrote: > Executing the following code, > > > > >>> import numpy as np > > >>> a = np.zeros((3,)) > > >>> w = np.array([0, 1, 0, 1, 2]) > > >>> v = np.array([10.0, 1, 10.0, 2, 9]) > > >>> a[w] += v > > > > I was expecting ?a? to be array([20., 3., 9.]. Instead I get > > > > >>> a > > array([ 10., 2., 9.]) > > > > This with numpy version 1.6.1. > > Is there another way to do the accumulation I want? > > Since you have addition, you should use np.bincount - Sebastian > > Thanks, > Ralf > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From jtaylor.debian at googlemail.com Wed Jan 22 12:32:35 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 22 Jan 2014 18:32:35 +0100 Subject: [Numpy-discussion] accumulation operation In-Reply-To: <1150F30B0E0E7844ABF5FB64D0A8FA58240B5D17@US01WEMBX2.internal.synopsys.com> References: <1150F30B0E0E7844ABF5FB64D0A8FA58240B5D17@US01WEMBX2.internal.synopsys.com> Message-ID: <52E000B3.5000900@googlemail.com> On 22.01.2014 18:23, Ralf Juengling wrote: > Executing the following code, > > > >>>> import numpy as np > >>>> a = np.zeros((3,)) > >>>> w = np.array([0, 1, 0, 1, 2]) > >>>> v = np.array([10.0, 1, 10.0, 2, 9]) > >>>> a[w] += v > > > > I was expecting ?a? to be array([20., 3., 9.]. Instead I get > > > >>>> a > > array([ 10., 2., 9.]) > > > > This with numpy version 1.6.1. > > Is there another way to do the accumulation I want? > you want: np.add.at(a, w, v) which is available in numpy 1.8 From andrew.collette at gmail.com Wed Jan 22 12:45:56 2014 From: andrew.collette at gmail.com (Andrew Collette) Date: Wed, 22 Jan 2014 10:45:56 -0700 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140122104646.GA2555@gmail.com> References: <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> Message-ID: Hi Oscar, > Is it fair to say that people should really be using vlen utf-8 strings for > text? Is it problematic because of the need to interface with non-Python > libraries using the same hdf5 file? The general recommendation has been to use fixed-width strings for exactly that reason; FORTRAN programs can't handle vlens, and older versions of IDL would refuse to deal with anything labelled utf-8, even fixed-width. >> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary >> > encoding dtype, anyway). > > That's what I was thinking. A ragged utf-8 array could map to an array of vlen > strings. Or am I misunderstanding how hdf5 works? Yes, that's exactly how HDF5 works for this; at the moment, we handle vlens with the NumPy object ("O") type storing regular Python strings. A native variable-length NumPy equivalent would also be appreciated, although I suspect it's a lot of work. > Truncating utf-8 is never a good idea. Throwing an error message when it would > truncate is okay though. Presumably you already do this when someone tries to > assign an ASCII string that's too long right? We advertise that HDF5 datasets work identically (as closely as practical) to NumPy arrays; in this case, NumPy truncates and doesn't warn, so we do the same. The concern with "U" is more that someone would write a "U10" string into a 10-byte HDF5 buffer and lose data, even though the advertised widths were the same. As an observation, a pure-ASCII NumPy type like the proposed "s" would avoid that completely. With a latin-1 type, it could still happen as certain characters would become 2 UTF-8 bytes. Andrew From chris.barker at noaa.gov Wed Jan 22 15:07:28 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 22 Jan 2014 12:07:28 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140122104646.GA2555@gmail.com> References: <20140116104303.GA11119@gmail.com> <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> Message-ID: On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin wrote: > BTW, as much as the fixed-width 'S' dtype doesn't really work for str in > Python 3 it's also a poor fit for bytes since it strips trailing nulls: > > >>> a = np.array(['a\0s\0', 'qwert'], dtype='S') > >>> a > array([b'a\x00s', b'qwert'], > dtype='|S5') > >>> a[0] > b'a\x00s' WHOOA! Good catch, Oscar. This conversation started with me suggesting that 'S' on py3 should mean "ascii string" (or latin-1 string). Then it was pointed out that it was already being used for arbitrary bytes, and thus could not be changed to mean a string without breaking already working code. However, if 'S' is assigning meaning to null bytes, and doing something with that, then it is, indeed being treated as an ANSI string (or the old c string "type", anyway). And any code that is expecting it to be arbitrary bytes is already broken, and in a way that could result in pretty subtle, hard to find bugs in the future. I think we really need a proper bytes dtype (which could be 'S' with the null byte thing removed), and a proper one-byte-per-character string type. Though I still don't know the use case for the fixed-length bytes type that can't be satisfied with the other numeric types, maybe: In [58]: bytes_15 = np.dtype(('B', 15)) though that doesn't in fact do what I expect: In [59]: arr = np.zeros((5,), dtype = bytes_15) In [60]: arr Out[60]: array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8) shouldn't I get a shape (5,) array, with each element a compound dtype with 15 bytes in it??? How would I spell that? By the way, from the docs for dtypes: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html """ The first character specifies the kind of data and the remaining characters specify how many bytes of data. The supported kinds are 'b' Boolean 'i' (signed) integer 'u' unsigned integer 'f' floating-point 'c' complex-floating point 'S', 'a', string 'U' unicode 'V' raw data (void) """ Could we use the 'a' for ascii string? (even though in now mapps directly to 'S') And by the way, the docs clearly say "string" there -- not bytes, so at the very least we need to update the docs... -Chris Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Wed Jan 22 16:13:32 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Wed, 22 Jan 2014 21:13:32 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> Message-ID: <20140122211328.GA1938@gmail.com> On Wed, Jan 22, 2014 at 12:07:28PM -0800, Chris Barker wrote: > On Wed, Jan 22, 2014 at 2:46 AM, Oscar Benjamin > wrote: > > > BTW, as much as the fixed-width 'S' dtype doesn't really work for str in > > Python 3 it's also a poor fit for bytes since it strips trailing nulls: > > > > >>> a = np.array(['a\0s\0', 'qwert'], dtype='S') > > >>> a > > array([b'a\x00s', b'qwert'], > > dtype='|S5') > > >>> a[0] > > b'a\x00s' > > > WHOOA! Good catch, Oscar. > > This conversation started with me suggesting that 'S' on py3 should mean > "ascii string" (or latin-1 string). > > Then it was pointed out that it was already being used for arbitrary bytes, > and thus could not be changed to mean a string without breaking already > working code. > > However, if 'S' is assigning meaning to null bytes, and doing something > with that, then it is, indeed being treated as an ANSI string (or the old c > string "type", anyway). And any code that is expecting it to be arbitrary > bytes is already broken, and in a way that could result in pretty subtle, > hard to find bugs in the future. > > I think we really need a proper bytes dtype (which could be 'S' with the > null byte thing removed), and a proper one-byte-per-character string type. It's not safe to stop removing the null bytes. This is how numpy determines the length of the strings in a dtype='S' array. The strings are not "fixed-width" but rather have a maximum width. Aything shorter gets padded with nulls. This is transparent if you index strings from the array: >>> a = np.array(b'a string of different length words'.split(), dtype='S') >>> a array([b'a', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') >>> a[0] b'a' >>> len(a[0]) 1 >>> a.tostring() b'a\x00\x00\x00\x00\x00\x00\x00\x00string\x00\x00\x00of\x00\x00\x00\x00\x00\x00\x00differentlength\x00\x00\x00words\x00\x00\x00\x00'o If the trailing nulls are not removed then you would get: >>> a[0] b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> len(a[0]) 9 And I'm sure that someone would get upset about that. > Though I still don't know the use case for the fixed-length bytes type that > can't be satisfied with the other numeric types, Having the null bytes removed and a str (on Py2) object returned is precisely the use case that distinguishes it from np.uint8. The other differences are the removal of arithmetic operations. Some more oddities: >>> a[0] = 1 >>> a array([b'1', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') >>> a[0] = None >>> a array([b'None', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') >>> a[0] = range(1, 2) Traceback (most recent call last): File "", line 1, in ValueError: cannot set an array element with a sequence >>> a[0] = (x for x in range(2)) >>> a array([b' References: Message-ID: On Tue, Jan 21, 2014 at 5:46 PM, Charles R Harris wrote: > > > > On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone wrote: > >> >> >What are your interests and experience? If you use numpy, are there >>> things >>> >you would like to fix, or enhancements you would like to see? >>> >>> Chuck >>> >>> >> I am an undergraduate student with CS as major and have interest in Math >> and Physics. This has led me to use NumPy and SciPy to work on innumerable >> cases involving special polynomial functions and polynomials like Legendre >> polynomials, Bessel Functions and so on. So, The packages are closer known >> to me from this point of view. I have a* few proposals* in mind. But I >> don't have any idea if they are acceptable within the scope of GSoC >> 1. Many special functions and polynomials are neither included in NumPy >> nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames >> function), Cylindrical Harmonic function. Scipy at present supports only >> spherical Harmonic function. >> > > Further, why cant we extend SciPy to incorporate* Inverse Laplace >> Transforms*? At present Matlab has this amazing function *ilaplace* and >> SymPy does have *Inverse_Laplace_transform* but it would be better to >> incorporate all in one package. I mean SciPy does have function to evaluate >> laplace transform >> > Scipy doesn't have a function for the Laplace transform, it has only a Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage. An inverse Laplace transform would be very welcome I'd think - it has real world applications, and there's no good implementation in any open source library as far as I can tell. It's probably doable, but not the easiest topic for a GSoC I think. From what I can find, the paper "Numerical Transform Inversion Using Gaussian Quadrature" from den Iseger contains what's considered the current state of the art algorithm. Browsing that gives a reasonable idea of the difficulty of implementing `ilaplace`. > After having written this, I feel that this post should have been sent to >> SciPy >> but as a majority of contributors are the same I proceed. >> Please suggest any other possible projects, >> > You can have a look at https://github.com/scipy/scipy/pull/2908/files for ideas. Most of the things that need improving or we really think we should have in Scipy are listed there. Possible topics are not restricted to that list though - it's more important that you pick something you're interested in and have the required background and coding skills for. Cheers, Ralf as I would like to continue with SciPy or NumPy, preferably NumPy as I have >> been fiddling with its source code for a month now and so am pretty >> comfortable with it. >> >> As for my experience, I have known C for past 4 years and have been a >> python lover for past 1 year. I am pretty new to open source communities, >> started before a manth and a half. >> >> > It does sound like scipy might be a better match, I don't think anyone > would complain if you cross posted. Both scipy and numpy require GSOC > candidates to have a pull request accepted as part of the application > process. I'd suggest implementing a function not currently in scipy that > you think would be useful. That would also help in finding a mentor for the > summer. I'd also suggest getting familiar with cython. > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Jan 22 20:53:26 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Wed, 22 Jan 2014 17:53:26 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140122211328.GA1938@gmail.com> References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> Message-ID: <7210200738621770529@unknownmsgid> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin wrote: > > It's not safe to stop removing the null bytes. This is how numpy determines > the length of the strings in a dtype='S' array. The strings are not > "fixed-width" but rather have a maximum width. Exactly--but folks have told us on this list that they want (and are) using the 'S' style for arbitrary bytes, NOT for text. In which case you wouldn't want to remove null bytes. This is more evidence that 'S' was designed to handle c-style one-byte-per-char strings, and NOT arbitrary bytes, and thus not to map directly to the py2 string type (you can store null bytes in a py2 string" Which brings me back to my original proposal: properly map the 'S' type to the py3 data model, and maybe add some kind of fixed width bytes style of there is a use case for that. I still have no idea what the use case might be. > If the trailing nulls are not removed then you would get: > >>>> a[0] > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>>> len(a[0]) > 9 > > And I'm sure that someone would get upset about that. Only if they are using it for text-which you "should not" do with py3. > Having the null bytes removed and a str (on Py2) object returned is precisely > the use case that distinguishes it from np.uint8. But that was because it was designed to be used with text . And if you want text, then you should use py3 strings, not bytes. And if you really want bytes, then you wouldn't want null bytes removed. > The other differences are the > removal of arithmetic operations. And 'S' is treated as an atomic element, I'm not sure how you can do that cleanly with uint8. > Some more oddities: > >>>> a[0] = 1 >>>> a > array([b'1', b'string', b'of', b'different', b'length', b'words'], > dtype='|S9') >>>> a[0] = None >>>> a > array([b'None', b'string', b'of', b'different', b'length', b'words'], > dtype='|S9') More evidence that this is a text type..... -Chris From oscar.j.benjamin at gmail.com Thu Jan 23 05:45:22 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 23 Jan 2014 10:45:22 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <7210200738621770529@unknownmsgid> References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> Message-ID: <20140123104520.GA2300@gmail.com> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: > On Jan 22, 2014, at 1:13 PM, Oscar Benjamin wrote: > > > > > It's not safe to stop removing the null bytes. This is how numpy determines > > the length of the strings in a dtype='S' array. The strings are not > > "fixed-width" but rather have a maximum width. > > Exactly--but folks have told us on this list that they want (and are) > using the 'S' style for arbitrary bytes, NOT for text. In which case > you wouldn't want to remove null bytes. This is more evidence that 'S' > was designed to handle c-style one-byte-per-char strings, and NOT > arbitrary bytes, and thus not to map directly to the py2 string type > (you can store null bytes in a py2 string" You can store null bytes in a Py2 string but you normally wouldn't if it was supposed to be text. > > Which brings me back to my original proposal: properly map the 'S' > type to the py3 data model, and maybe add some kind of fixed width > bytes style of there is a use case for that. I still have no idea what > the use case might be. > There would definitely be a use case for a fixed-byte-width bytes-representing-text dtype in record arrays to read from a binary file: dt = np.dtype([ ('name', '|b8:utf-8'), ('param1', ' > If the trailing nulls are not removed then you would get: > > > >>>> a[0] > > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' > >>>> len(a[0]) > > 9 > > > > And I'm sure that someone would get upset about that. > > Only if they are using it for text-which you "should not" do with py3. But people definitely are using it for text on Python 3. It should be deprecated in favour of something new but breaking it is just gratuitous. Numpy doesn't have the option to make a clean break with Python 3 precisely because it needs to straddle 2.x and 3.x while numpy-based applications are ported to 3.x. > > Some more oddities: > > > >>>> a[0] = 1 > >>>> a > > array([b'1', b'string', b'of', b'different', b'length', b'words'], > > dtype='|S9') > >>>> a[0] = None > >>>> a > > array([b'None', b'string', b'of', b'different', b'length', b'words'], > > dtype='|S9') > > More evidence that this is a text type..... And the big one: $ python3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>> a array([b'asd', b'zxc'], dtype='|S3') >>> a[0] = 'qwer' # Unicode string again >>> a array([b'qwe', b'zxc'], dtype='|S3') >>> a[0] = '?scar' Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) The analogous behaviour was very deliberately removed from Python 3: >>> a[0] == 'qwe' False >>> a[0] == b'qwe' True Oscar From njs at pobox.com Thu Jan 23 10:24:37 2014 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 23 Jan 2014 15:24:37 +0000 Subject: [Numpy-discussion] IRR Message-ID: Hey all, We have a PR languishing that fixes np.irr to handle negative rate-of-returns: https://github.com/numpy/numpy/pull/4210 I don't even know what "IRR" stands for, and it seems rather confusing from the discussion there. Anyone who knows something about the issues is invited to speak up... -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From josef.pktd at gmail.com Thu Jan 23 10:37:02 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 10:37:02 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140123104520.GA2300@gmail.com> References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin wrote: > On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: >> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin wrote: >> >> > >> > It's not safe to stop removing the null bytes. This is how numpy determines >> > the length of the strings in a dtype='S' array. The strings are not >> > "fixed-width" but rather have a maximum width. >> >> Exactly--but folks have told us on this list that they want (and are) >> using the 'S' style for arbitrary bytes, NOT for text. In which case >> you wouldn't want to remove null bytes. This is more evidence that 'S' >> was designed to handle c-style one-byte-per-char strings, and NOT >> arbitrary bytes, and thus not to map directly to the py2 string type >> (you can store null bytes in a py2 string" > > You can store null bytes in a Py2 string but you normally wouldn't if it was > supposed to be text. > >> >> Which brings me back to my original proposal: properly map the 'S' >> type to the py3 data model, and maybe add some kind of fixed width >> bytes style of there is a use case for that. I still have no idea what >> the use case might be. >> > > There would definitely be a use case for a fixed-byte-width > bytes-representing-text dtype in record arrays to read from a binary file: > > dt = np.dtype([ > ('name', '|b8:utf-8'), > ('param1', ' ('param2', ' ... > ]) > > with open('binaryfile', 'rb') as fin: > a = np.fromfile(fin, dtype=dt) > > You could also use this for ASCII if desired. I don't think it really matters > that utf-8 uses variable width as long as a too long byte string throws an > error (and does not truncate). > > For non 8-bit encodings there would have to be some way to handle endianness > without a BOM, but otherwise I think that it's always possible to pad with zero > *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip > null *characters* after decoding. i.e.: > > $ cat tmp.py > import encodings > > def test_encoding(s1, enc): > b = s1.encode(enc).ljust(32, b'\0') > s2 = b.decode(enc) > index = s2.find('\0') > if index != -1: > s2 = s2[:index] > assert s1 == s2, enc > > encodings_set = set(encodings.aliases.aliases.values()) > > for N, enc in enumerate(encodings_set): > try: > test_encoding('qwe', enc) > except LookupError: > pass > > print('Tested %d encodings without error' % N) > $ python3 tmp.py > Tested 88 encodings without error > >> > If the trailing nulls are not removed then you would get: >> > >> >>>> a[0] >> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >> >>>> len(a[0]) >> > 9 >> > >> > And I'm sure that someone would get upset about that. >> >> Only if they are using it for text-which you "should not" do with py3. > > But people definitely are using it for text on Python 3. It should be > deprecated in favour of something new but breaking it is just gratuitous. > Numpy doesn't have the option to make a clean break with Python 3 precisely > because it needs to straddle 2.x and 3.x while numpy-based applications are > ported to 3.x. > >> > Some more oddities: >> > >> >>>> a[0] = 1 >> >>>> a >> > array([b'1', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >>>> a[0] = None >> >>>> a >> > array([b'None', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >> More evidence that this is a text type..... > > And the big one: > > $ python3 > Python 3.2.3 (default, Sep 25 2013, 18:22:43) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import numpy as np >>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>>> a > array([b'asd', b'zxc'], > dtype='|S3') >>>> a[0] = 'qwer' # Unicode string again >>>> a > array([b'qwe', b'zxc'], > dtype='|S3') >>>> a[0] = '?scar' > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) > > The analogous behaviour was very deliberately removed from Python 3: > >>>> a[0] == 'qwe' > False >>>> a[0] == b'qwe' > True > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Thu Jan 23 10:41:30 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 10:41:30 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140123104520.GA2300@gmail.com> References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin wrote: > On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: >> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin wrote: >> >> > >> > It's not safe to stop removing the null bytes. This is how numpy determines >> > the length of the strings in a dtype='S' array. The strings are not >> > "fixed-width" but rather have a maximum width. >> >> Exactly--but folks have told us on this list that they want (and are) >> using the 'S' style for arbitrary bytes, NOT for text. In which case >> you wouldn't want to remove null bytes. This is more evidence that 'S' >> was designed to handle c-style one-byte-per-char strings, and NOT >> arbitrary bytes, and thus not to map directly to the py2 string type >> (you can store null bytes in a py2 string" > > You can store null bytes in a Py2 string but you normally wouldn't if it was > supposed to be text. > >> >> Which brings me back to my original proposal: properly map the 'S' >> type to the py3 data model, and maybe add some kind of fixed width >> bytes style of there is a use case for that. I still have no idea what >> the use case might be. >> > > There would definitely be a use case for a fixed-byte-width > bytes-representing-text dtype in record arrays to read from a binary file: > > dt = np.dtype([ > ('name', '|b8:utf-8'), > ('param1', ' ('param2', ' ... > ]) > > with open('binaryfile', 'rb') as fin: > a = np.fromfile(fin, dtype=dt) > > You could also use this for ASCII if desired. I don't think it really matters > that utf-8 uses variable width as long as a too long byte string throws an > error (and does not truncate). > > For non 8-bit encodings there would have to be some way to handle endianness > without a BOM, but otherwise I think that it's always possible to pad with zero > *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip > null *characters* after decoding. i.e.: > > $ cat tmp.py > import encodings > > def test_encoding(s1, enc): > b = s1.encode(enc).ljust(32, b'\0') > s2 = b.decode(enc) > index = s2.find('\0') > if index != -1: > s2 = s2[:index] > assert s1 == s2, enc > > encodings_set = set(encodings.aliases.aliases.values()) > > for N, enc in enumerate(encodings_set): > try: > test_encoding('qwe', enc) > except LookupError: > pass > > print('Tested %d encodings without error' % N) > $ python3 tmp.py > Tested 88 encodings without error > >> > If the trailing nulls are not removed then you would get: >> > >> >>>> a[0] >> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >> >>>> len(a[0]) >> > 9 >> > >> > And I'm sure that someone would get upset about that. >> >> Only if they are using it for text-which you "should not" do with py3. > > But people definitely are using it for text on Python 3. It should be > deprecated in favour of something new but breaking it is just gratuitous. > Numpy doesn't have the option to make a clean break with Python 3 precisely > because it needs to straddle 2.x and 3.x while numpy-based applications are > ported to 3.x. > >> > Some more oddities: >> > >> >>>> a[0] = 1 >> >>>> a >> > array([b'1', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >>>> a[0] = None >> >>>> a >> > array([b'None', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >> More evidence that this is a text type..... > > And the big one: > > $ python3 > Python 3.2.3 (default, Sep 25 2013, 18:22:43) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import numpy as np >>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>>> a > array([b'asd', b'zxc'], > dtype='|S3') >>>> a[0] = 'qwer' # Unicode string again >>>> a > array([b'qwe', b'zxc'], > dtype='|S3') >>>> a[0] = '?scar' > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) looks mostly like casting rules to me, which looks like ASCII based instead of an arbitrary encoding. >>> a = np.array(['asd', 'zxc'], dtype='S') >>> b = a.astype('U') >>> b[0] = '?scar' >>> a[0] = '?scar' Traceback (most recent call last): File "", line 1, in a[0] = '?scar' UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> b array(['?sc', 'zxc'], dtype='>> b.astype('S') Traceback (most recent call last): File "", line 1, in b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> b.view('S4') array([b'\xd5', b's', b'c', b'z', b'x', b'c'], dtype='|S4') >>> a.astype('U').astype('S') array([b'asd', b'zxc'], dtype='|S3') Josef > > The analogous behaviour was very deliberately removed from Python 3: > >>>> a[0] == 'qwe' > False >>>> a[0] == b'qwe' > True > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Thu Jan 23 11:23:09 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 11:23:09 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <52D92DAE.5020409@witherden.org> <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 10:41 AM, wrote: > On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin > wrote: >> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: >>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin wrote: >>> >>> > >>> > It's not safe to stop removing the null bytes. This is how numpy determines >>> > the length of the strings in a dtype='S' array. The strings are not >>> > "fixed-width" but rather have a maximum width. >>> >>> Exactly--but folks have told us on this list that they want (and are) >>> using the 'S' style for arbitrary bytes, NOT for text. In which case >>> you wouldn't want to remove null bytes. This is more evidence that 'S' >>> was designed to handle c-style one-byte-per-char strings, and NOT >>> arbitrary bytes, and thus not to map directly to the py2 string type >>> (you can store null bytes in a py2 string" >> >> You can store null bytes in a Py2 string but you normally wouldn't if it was >> supposed to be text. >> >>> >>> Which brings me back to my original proposal: properly map the 'S' >>> type to the py3 data model, and maybe add some kind of fixed width >>> bytes style of there is a use case for that. I still have no idea what >>> the use case might be. >>> >> >> There would definitely be a use case for a fixed-byte-width >> bytes-representing-text dtype in record arrays to read from a binary file: >> >> dt = np.dtype([ >> ('name', '|b8:utf-8'), >> ('param1', '> ('param2', '> ... >> ]) >> >> with open('binaryfile', 'rb') as fin: >> a = np.fromfile(fin, dtype=dt) >> >> You could also use this for ASCII if desired. I don't think it really matters >> that utf-8 uses variable width as long as a too long byte string throws an >> error (and does not truncate). >> >> For non 8-bit encodings there would have to be some way to handle endianness >> without a BOM, but otherwise I think that it's always possible to pad with zero >> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip >> null *characters* after decoding. i.e.: >> >> $ cat tmp.py >> import encodings >> >> def test_encoding(s1, enc): >> b = s1.encode(enc).ljust(32, b'\0') >> s2 = b.decode(enc) >> index = s2.find('\0') >> if index != -1: >> s2 = s2[:index] >> assert s1 == s2, enc >> >> encodings_set = set(encodings.aliases.aliases.values()) >> >> for N, enc in enumerate(encodings_set): >> try: >> test_encoding('qwe', enc) >> except LookupError: >> pass >> >> print('Tested %d encodings without error' % N) >> $ python3 tmp.py >> Tested 88 encodings without error >> >>> > If the trailing nulls are not removed then you would get: >>> > >>> >>>> a[0] >>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> >>>> len(a[0]) >>> > 9 >>> > >>> > And I'm sure that someone would get upset about that. >>> >>> Only if they are using it for text-which you "should not" do with py3. >> >> But people definitely are using it for text on Python 3. It should be >> deprecated in favour of something new but breaking it is just gratuitous. >> Numpy doesn't have the option to make a clean break with Python 3 precisely >> because it needs to straddle 2.x and 3.x while numpy-based applications are >> ported to 3.x. >> >>> > Some more oddities: >>> > >>> >>>> a[0] = 1 >>> >>>> a >>> > array([b'1', b'string', b'of', b'different', b'length', b'words'], >>> > dtype='|S9') >>> >>>> a[0] = None >>> >>>> a >>> > array([b'None', b'string', b'of', b'different', b'length', b'words'], >>> > dtype='|S9') >>> >>> More evidence that this is a text type..... >> >> And the big one: >> >> $ python3 >> Python 3.2.3 (default, Sep 25 2013, 18:22:43) >> [GCC 4.6.3] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >>>>> import numpy as np >>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>>>> a >> array([b'asd', b'zxc'], >> dtype='|S3') >>>>> a[0] = 'qwer' # Unicode string again >>>>> a >> array([b'qwe', b'zxc'], >> dtype='|S3') >>>>> a[0] = '?scar' >> Traceback (most recent call last): >> File "", line 1, in >> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) > > looks mostly like casting rules to me, which looks like ASCII based > instead of an arbitrary encoding. > >>>> a = np.array(['asd', 'zxc'], dtype='S') >>>> b = a.astype('U') >>>> b[0] = '?scar' >>>> a[0] = '?scar' > Traceback (most recent call last): > File "", line 1, in > a[0] = '?scar' > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in > position 0: ordinal not in range(128) >>>> b > array(['?sc', 'zxc'], > dtype='>>> b.astype('S') > Traceback (most recent call last): > File "", line 1, in > b.astype('S') > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in > position 0: ordinal not in range(128) >>>> b.view('S4') > array([b'\xd5', b's', b'c', b'z', b'x', b'c'], > dtype='|S4') > >>>> a.astype('U').astype('S') > array([b'asd', b'zxc'], > dtype='|S3') another curious example, encode utf-8 to latin-1 bytes >>> b array(['?sc', 'zxc'], dtype='>> b[0].encode('utf8') b'\xc3\x95sc' >>> b[0].encode('latin1') b'\xd5sc' >>> b.astype('S') Traceback (most recent call last): File "", line 1, in b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> c = b.view('S4').astype('S1').view('S3') >>> c array([b'\xd5sc', b'zxc'], dtype='|S3') >>> c[0].decode('latin1') '?sc' -------- The original numpy py3 conversion used latin-1 as default (It's still used in statsmodels, and I haven't looked at the structure under the common py2-3 codebase) if sys.version_info[0] >= 3: import io bytes = bytes unicode = str asunicode = str def asbytes(s): if isinstance(s, bytes): return s return s.encode('latin1') def asstr(s): if isinstance(s, str): return s return s.decode('latin1') -------------- Josef > > Josef > >> >> The analogous behaviour was very deliberately removed from Python 3: >> >>>>> a[0] == 'qwe' >> False >>>>> a[0] == b'qwe' >> True >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion From oscar.j.benjamin at gmail.com Thu Jan 23 11:43:09 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 23 Jan 2014 16:43:09 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> Message-ID: <20140123164305.GA5688@gmail.com> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote: > > another curious example, encode utf-8 to latin-1 bytes > > >>> b > array(['?sc', 'zxc'], > dtype=' >>> b[0].encode('utf8') > b'\xc3\x95sc' > >>> b[0].encode('latin1') > b'\xd5sc' > >>> b.astype('S') > Traceback (most recent call last): > File "", line 1, in > b.astype('S') > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in > position 0: ordinal not in range(128) > >>> c = b.view('S4').astype('S1').view('S3') > >>> c > array([b'\xd5sc', b'zxc'], > dtype='|S3') > >>> c[0].decode('latin1') > '?sc' Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses ascii: >>> np.array(['?sc']).astype('S4') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> np.array(['?sc']).view('S4') array([b'\xd5', b's', b'c'], dtype='|S4') > -------- > The original numpy py3 conversion used latin-1 as default > (It's still used in statsmodels, and I haven't looked at the structure > under the common py2-3 codebase) > > if sys.version_info[0] >= 3: > import io > bytes = bytes > unicode = str > asunicode = str These two functions are an abomination: > def asbytes(s): > if isinstance(s, bytes): > return s > return s.encode('latin1') > def asstr(s): > if isinstance(s, str): > return s > return s.decode('latin1') Oscar From josef.pktd at gmail.com Thu Jan 23 11:58:38 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 11:58:38 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: <20140123164305.GA5688@gmail.com> References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin wrote: > On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote: >> >> another curious example, encode utf-8 to latin-1 bytes >> >> >>> b >> array(['?sc', 'zxc'], >> dtype='> >>> b[0].encode('utf8') >> b'\xc3\x95sc' >> >>> b[0].encode('latin1') >> b'\xd5sc' >> >>> b.astype('S') >> Traceback (most recent call last): >> File "", line 1, in >> b.astype('S') >> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in >> position 0: ordinal not in range(128) >> >>> c = b.view('S4').astype('S1').view('S3') >> >>> c >> array([b'\xd5sc', b'zxc'], >> dtype='|S3') >> >>> c[0].decode('latin1') >> '?sc' > > Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses > ascii: > >>>> np.array(['?sc']).astype('S4') > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>>> np.array(['?sc']).view('S4') > array([b'\xd5', b's', b'c'], > dtype='|S4') No, a view doesn't change the memory, it just changes the interpretation and there shouldn't be any conversion involved. astype does type conversion, but it goes through ascii encoding which fails. >>> b = np.array(['?sc', 'zxc'], dtype='>> b.tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>> b.view('S12') array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], dtype='|S12') The conversion happens somewhere in the array creation, but I have no idea about the memory encoding for uc2 and the low level layouts. Josef > >> -------- >> The original numpy py3 conversion used latin-1 as default >> (It's still used in statsmodels, and I haven't looked at the structure >> under the common py2-3 codebase) >> >> if sys.version_info[0] >= 3: >> import io >> bytes = bytes >> unicode = str >> asunicode = str > > These two functions are an abomination: > >> def asbytes(s): >> if isinstance(s, bytes): >> return s >> return s.encode('latin1') >> def asstr(s): >> if isinstance(s, str): >> return s >> return s.decode('latin1') > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Thu Jan 23 12:13:55 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 12:13:55 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 11:58 AM, wrote: > On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin > wrote: >> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote: >>> >>> another curious example, encode utf-8 to latin-1 bytes >>> >>> >>> b >>> array(['?sc', 'zxc'], >>> dtype='>> >>> b[0].encode('utf8') >>> b'\xc3\x95sc' >>> >>> b[0].encode('latin1') >>> b'\xd5sc' >>> >>> b.astype('S') >>> Traceback (most recent call last): >>> File "", line 1, in >>> b.astype('S') >>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in >>> position 0: ordinal not in range(128) >>> >>> c = b.view('S4').astype('S1').view('S3') >>> >>> c >>> array([b'\xd5sc', b'zxc'], >>> dtype='|S3') >>> >>> c[0].decode('latin1') >>> '?sc' >> >> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses >> ascii: >> >>>>> np.array(['?sc']).astype('S4') >> Traceback (most recent call last): >> File "", line 1, in >> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>>>> np.array(['?sc']).view('S4') >> array([b'\xd5', b's', b'c'], >> dtype='|S4') > > > No, a view doesn't change the memory, it just changes the > interpretation and there shouldn't be any conversion involved. > astype does type conversion, but it goes through ascii encoding which fails. > >>>> b = np.array(['?sc', 'zxc'], dtype='>>> b.tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>> b.view('S12') > array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], > dtype='|S12') > > The conversion happens somewhere in the array creation, but I have no > idea about the memory encoding for uc2 and the low level layouts. utf8 encoded bytes >>> a = np.array(['?sc'.encode('utf8'), 'zxc'], dtype='S') >>> a array([b'\xc3\x95sc', b'zxc'], dtype='|S4') >>> a.tostring() b'\xc3\x95sczxc\x00' >>> a.view('S8') array([b'\xc3\x95sczxc'], dtype='|S8') >>> a[0].decode('latin1') '?\x95sc' >>> a[0].decode('utf8') '?sc' Josef > > Josef > >> >>> -------- >>> The original numpy py3 conversion used latin-1 as default >>> (It's still used in statsmodels, and I haven't looked at the structure >>> under the common py2-3 codebase) >>> >>> if sys.version_info[0] >= 3: >>> import io >>> bytes = bytes >>> unicode = str >>> asunicode = str >> >> These two functions are an abomination: >> >>> def asbytes(s): >>> if isinstance(s, bytes): >>> return s >>> return s.encode('latin1') >>> def asstr(s): >>> if isinstance(s, str): >>> return s >>> return s.decode('latin1') >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Thu Jan 23 12:42:13 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 12:42:13 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 12:13 PM, wrote: > On Thu, Jan 23, 2014 at 11:58 AM, wrote: >> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin >> wrote: >>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote: >>>> >>>> another curious example, encode utf-8 to latin-1 bytes >>>> >>>> >>> b >>>> array(['?sc', 'zxc'], >>>> dtype='>>> >>> b[0].encode('utf8') >>>> b'\xc3\x95sc' >>>> >>> b[0].encode('latin1') >>>> b'\xd5sc' >>>> >>> b.astype('S') >>>> Traceback (most recent call last): >>>> File "", line 1, in >>>> b.astype('S') >>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in >>>> position 0: ordinal not in range(128) >>>> >>> c = b.view('S4').astype('S1').view('S3') >>>> >>> c >>>> array([b'\xd5sc', b'zxc'], >>>> dtype='|S3') >>>> >>> c[0].decode('latin1') >>>> '?sc' >>> >>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses >>> ascii: >>> >>>>>> np.array(['?sc']).astype('S4') >>> Traceback (most recent call last): >>> File "", line 1, in >>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>>>>> np.array(['?sc']).view('S4') >>> array([b'\xd5', b's', b'c'], >>> dtype='|S4') >> >> >> No, a view doesn't change the memory, it just changes the >> interpretation and there shouldn't be any conversion involved. >> astype does type conversion, but it goes through ascii encoding which fails. >> >>>>> b = np.array(['?sc', 'zxc'], dtype='>>>> b.tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>> b.view('S12') >> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >> dtype='|S12') >> >> The conversion happens somewhere in the array creation, but I have no >> idea about the memory encoding for uc2 and the low level layouts. >>> b = np.array(['?sc', 'zxc'], dtype='>> b[0].tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>> '?sc'.encode('utf-32LE') b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' Is that the encoding for 'U' ? --- another sideeffect of null truncation: cannot decode truncated data >>> b.view('S4').tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>> b.view('S4')[0] b'\xd5' >>> b.view('S4')[0].tostring() b'\xd5' >>> b.view('S4')[:1].tostring() b'\xd5\x00\x00\x00' >>> b.view('S4')[0].decode('utf-32LE') Traceback (most recent call last): File "", line 1, in b.view('S4')[0].decode('utf-32LE') File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode return codecs.utf_32_le_decode(input, errors, True) UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position 0: truncated data >>> b.view('S4')[:1].tostring().decode('utf-32LE') '?' numpy arrays need a decode and encode method Josef > > utf8 encoded bytes > >>>> a = np.array(['?sc'.encode('utf8'), 'zxc'], dtype='S') >>>> a > array([b'\xc3\x95sc', b'zxc'], > dtype='|S4') >>>> a.tostring() > b'\xc3\x95sczxc\x00' >>>> a.view('S8') > array([b'\xc3\x95sczxc'], > dtype='|S8') > >>>> a[0].decode('latin1') > '?\x95sc' >>>> a[0].decode('utf8') > '?sc' > > Josef > >> >> Josef >> >>> >>>> -------- >>>> The original numpy py3 conversion used latin-1 as default >>>> (It's still used in statsmodels, and I haven't looked at the structure >>>> under the common py2-3 codebase) >>>> >>>> if sys.version_info[0] >= 3: >>>> import io >>>> bytes = bytes >>>> unicode = str >>>> asunicode = str >>> >>> These two functions are an abomination: >>> >>>> def asbytes(s): >>>> if isinstance(s, bytes): >>>> return s >>>> return s.encode('latin1') >>>> def asstr(s): >>>> if isinstance(s, str): >>>> return s >>>> return s.decode('latin1') >>> >>> >>> Oscar >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion From josef.pktd at gmail.com Thu Jan 23 12:59:23 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 12:59:23 -0500 Subject: [Numpy-discussion] cannot decode 'S' Message-ID: truncating null bytes in 'S' breaks decoding that needs them >>> a = np.array([si.encode('utf-16LE') for si in ['?sc', 'zxc']], dtype='S') >>> a array([b'\xd5\x00s\x00c', b'z\x00x\x00c'], dtype='|S6') >>> [ai.decode('utf-16LE') for ai in a] Traceback (most recent call last): File "", line 1, in [ai.decode('utf-16LE') for ai in a] File "", line 1, in [ai.decode('utf-16LE') for ai in a] File "C:\Programs\Python33\lib\encodings\utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data messy workaround (arrays in contrast to scalars are not truncated in `tostring`) >>> [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] ['?sc', 'zxc'] Found while playing with examples in the other thread. Josef From oscar.j.benjamin at gmail.com Thu Jan 23 13:36:57 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 23 Jan 2014 18:36:57 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On 23 January 2014 17:42, wrote: > On Thu, Jan 23, 2014 at 12:13 PM, wrote: >> On Thu, Jan 23, 2014 at 11:58 AM, wrote: >>> >>> No, a view doesn't change the memory, it just changes the >>> interpretation and there shouldn't be any conversion involved. >>> astype does type conversion, but it goes through ascii encoding which fails. >>> >>>>>> b = np.array(['?sc', 'zxc'], dtype='>>>>> b.tostring() >>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>>> b.view('S12') >>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >>> dtype='|S12') >>> >>> The conversion happens somewhere in the array creation, but I have no >>> idea about the memory encoding for uc2 and the low level layouts. > >>>> b = np.array(['?sc', 'zxc'], dtype='>>> b[0].tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>>> '?sc'.encode('utf-32LE') > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' > > Is that the encoding for 'U' ? On a little-endian system, yes. I realise what' happening now. 'U' represents unicode characters as a 32-bit unsigned integer giving the code point of the character. The first 256 code points are exactly the 256 characters representable with latin-1 in the same order. So '?' has the code point 0xd5 and is encoded as the byte 0xd5 in latin-1. As a 32 bit integer the code point is 0x000000d5 but in little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So when you reinterpret that as 'S4' it strips the remaining nulls to get the byte string b'\xd5'. Which is the latin-1 encoding for the character. The same will happen for any string of latin-1 characters. However if you do have a code point of 256 or greater then you'll get a byte strings of length 2 or more. On a big-endian system I think you'd get b'\x00\x00\x00\xd5'. > another sideeffect of null truncation: cannot decode truncated data > >>>> b.view('S4').tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>> b.view('S4')[0] > b'\xd5' >>>> b.view('S4')[0].tostring() > b'\xd5' >>>> b.view('S4')[:1].tostring() > b'\xd5\x00\x00\x00' > >>>> b.view('S4')[0].decode('utf-32LE') > Traceback (most recent call last): > File "", line 1, in > b.view('S4')[0].decode('utf-32LE') > File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode > return codecs.utf_32_le_decode(input, errors, True) > UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position > 0: truncated data > >>>> b.view('S4')[:1].tostring().decode('utf-32LE') > '?' > > numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. Oscar From chris.barker at noaa.gov Thu Jan 23 13:49:42 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 10:49:42 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: Thanks for poking into this all. I've lost track a bit, but I think: The 'S' type is clearly broken on py3 (at least). I think that gives us room to change it, and backward compatibly is less of an issue because it's broken already -- do we need to preserve bug-for-bug compatibility? Maybe, but I suspect in this case, not -- the code the "works fine" on py3 with the 'S' type is probably only lucky that it hasn't encountered the issues yet. And no matter how you slice it, code being ported to py3 needs to deal with text handling issues. But here is where we stand: The 'S' dtype: - was designed for one-byte-per-char text data. - was mapped to the py2 string type. - used the classic C null-terminated approach. - can be used for arbitrary bytes (as the py2 string type can), but not quite, as it truncates null bytes -- so it really a bad idea to use it that way. Under py3: The 'S' type maps to the py3 bytes type, because that's the closest to the py2 string type. But it also does some inconsistent things with encoding, and does treat a lot of other things as text. But the py3 bytes type does not have the same text handling as the py2 string type, so things like: s = 'a string' np.array((s,), dtype='S')[0] == s Gives you False, rather than True on py2. This is because a py3 string is translated to the 'S' type (presumable with the default encoding, another maybe not a good idea, but returns a bytes object, which does not compare true to a py3 string. YOu can work aroudn this with varios calls to encode() and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't work well with the py3 text model. The py2 => py3 transition separated bytes and strings: strings are unicode, and bytes are not to be used for text (directly). While there is some text-related functionality still in bytes, the core devs are quite clear that that is for special cases only, and not for general text processing. I don't think numpy should fight this, but rather embrace the py3 text model. The most natural way to do that is to use the existing 'U' dtype for text. Really the best solution for most cases. (Like the above case) However, there is a use case for a more efficient way to deal with text. There are a couple ways to go about that that have been brought up here: 1: have a more efficient unicode dtype: variable length, multiple encoding options, etc.... - This is a fine idea that would support better text handling in numpy, and _maybe_ better interaction with external libraries (HDF, etc...) 2: Have a one-byte-per-char text dtype: - This would be much easier to implement fit into the current numpy model, and satisfy a lot of common use cases for scientific data sets. We could certainly do both, but I'd like to see (2) get done sooner than later.... A related issue is whether numpy needs a dtype analogous to py3 bytes -- I'm still not sure of the use-case there, so can't comment -- would it need to be fixed length (fitting into the numpy data model better) or variable length, or ??? Some folks are (apparently) using the current 'S' type in this way, but I think that's ripe for errors, due to the null bytes issue. Though maybe there is a null-bytes-are-special binary format that isn't text -- I have no idea. So what do we do with 'S'? It really is pretty broken, so we have a couple choices: (1) depricate it, so that it stays around for backward compatibility but encourage people to either use 'U' for text, or one of the new dtypes that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), and use either uint8 or the new bytes dtype that is yet to be implemented. (2) fix it -- in this case, I think we need to be clear what it is: -- A one-byte-char-text type? If so, it should map to a py3 string, and have a defined encoding (ascii or latin-1, probably), or even better a settable encoding (but only for one-byte-per-char encodings -- I don't think utf-8 is a good idea here, as a utf-8 encoded string is of unknown length. (there is some room for debate here, as the 'S' type is fixed length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as long as it doesn't partially truncate in teh middle of a charactor) -- a bytes type? in which case, we should clean out all teh automatic conversion to-from text that iare in it now. I vote for it being our one-byte text type -- it almost is already, and it would make the easiest transition for folks from py2 to py3. But backward compatibility is backward compatibility. > numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that > knows what encoding to use in order to have a binary interface as > exposed by .tostring() and friends and but produce unicode strings > when indexed from Python code. Having both a text and a binary > interface to the same data implies having an encoding. I agree with Oscar here -- let's not conflate encode and decoded data -- the py3 text model is a fine one, we should work with it as much as practical. UNLESS: if we do add a bytes dtype, then it would be a reasonable use case to use it to store encoded text (just like the py3 bytes types), in which case it would be good to have encode() and decode() methods or ufuncs -- probably ufuncs. But that should be for special purpose, at the I/O interface kind of stuff. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Thu Jan 23 14:18:20 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 14:18:20 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 1:49 PM, Chris Barker wrote: > > s = 'a string' > np.array((s,), dtype='S')[0] == s > > Gives you False, rather than True on py2. This is because a py3 string is > translated to the 'S' type (presumable with the default encoding, another > maybe not a good idea, but returns a bytes object, which does not compare > true to a py3 string. YOu can work aroudn this with varios calls to encode() > and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't > work well with the py3 text model. I think this is just inconsistent casting rules in numpy, numpy should either refuse to assign the wrong type, instead of using the repr as in some of the earlier examples of Oscar >>> s = np.inf >>> np.array((s,), dtype=int)[0] == s Traceback (most recent call last): File "", line 1, in np.array((s,), dtype=int)[0] == s OverflowError: cannot convert float infinity to integer or use the **same** conversion/casting rules also during the interaction with python as are used in assignments and array creation. Josef From chris.barker at noaa.gov Thu Jan 23 14:40:43 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 11:40:43 -0800 Subject: [Numpy-discussion] cannot decode 'S' In-Reply-To: References: Message-ID: Josef, Nice find -- another reason why 'S' can NOT be used a-is for arbitrary bytes. See the other thread for my proposals about that. > messy workaround (arrays in contrast to scalars are not truncated in > `tostring`) > > >>> [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] > ['?sc', 'zxc'] > > I think the real "work around" is to not try to store arbitrary bytes -- i.e. encoded text, in the 'S' dtype. But is there a convenient way to do it with other existing numpy types? I tried to do it with uint8, and it's really awkward.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Jan 23 14:45:40 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 11:45:40 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 11:18 AM, wrote: > I think this is just inconsistent casting rules in numpy, > > numpy should either refuse to assign the wrong type, instead of using > the repr as in some of the earlier examples of Oscar > > >>> s = np.inf > >>> np.array((s,), dtype=int)[0] == s > Traceback (most recent call last): > File "", line 1, in > np.array((s,), dtype=int)[0] == s > OverflowError: cannot convert float infinity to integer > > or use the **same** conversion/casting rules also during the > interaction with python as are used in assignments and array creation. > Exactly -- but what should those conversion/casting rules be? We can't decide that unless we decide if 'S' is for text or for arbitrary bytes -- it can't be both. I say text, that's what it's mostly trying to do already. But if it's bytes, fine, then some things still need cleaning up, and we could really use a one-byte-text type. and if it's text, then we may need a bytes dtype. Key here is that we don't have the option of not breaking anything, because there is a lot already broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Thu Jan 23 14:49:17 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 14:49:17 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: >> > numpy arrays need a decode and encode method > > >> I'm not sure that they do. Rather there needs to be a text dtype that >> knows what encoding to use in order to have a binary interface as >> exposed by .tostring() and friends and but produce unicode strings >> when indexed from Python code. Having both a text and a binary >> interface to the same data implies having an encoding. > > > I agree with Oscar here -- let's not conflate encode and decoded data -- > the py3 text model is a fine one, we should work with it as much as > practical. > > UNLESS: if we do add a bytes dtype, then it would be a reasonable use case > to use it to store encoded text (just like the py3 bytes types), in which > case it would be good to have encode() and decode() methods or ufuncs -- > probably ufuncs. But that should be for special purpose, at the I/O > interface kind of stuff. > I think we need both things changing the memory and changing the view. The same way we can convert between int and float and complex (trunc, astype, real, ...) we should be able to convert between bytes and any string (text) dtypes, i.e. decode and encode. I'm reading a file in binary and then want to convert it to unicode, only I realize I have only ascii and want to convert to something less memory hungry. views don't care about what the content means, it just has to be memory compatible, I can view anything as an 'S' or a 'uint' (I think). What we currently don't have is a string/text view on S that would interact with python as string. (that's a vote in favor of a minimal one char string dtype that would work for a limited number of encodings.) Josef From josef.pktd at gmail.com Thu Jan 23 15:10:34 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 15:10:34 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 2:45 PM, Chris Barker wrote: > On Thu, Jan 23, 2014 at 11:18 AM, wrote: > >> >> I think this is just inconsistent casting rules in numpy, >> >> numpy should either refuse to assign the wrong type, instead of using >> the repr as in some of the earlier examples of Oscar >> >> >>> s = np.inf >> >>> np.array((s,), dtype=int)[0] == s >> Traceback (most recent call last): >> File "", line 1, in >> np.array((s,), dtype=int)[0] == s >> OverflowError: cannot convert float infinity to integer >> >> or use the **same** conversion/casting rules also during the >> interaction with python as are used in assignments and array creation. > > > Exactly -- but what should those conversion/casting rules be? We can't > decide that unless we decide if 'S' is for text or for arbitrary bytes -- it > can't be both. I say text, that's what it's mostly trying to do already. But > if it's bytes, fine, then some things still need cleaning up, and we could > really use a one-byte-text type. and if it's text, then we may need a bytes > dtype. (remember I'm just a balcony muppet) As far as I understand all codecs have the same ascii part. So I would cast on ascii and raise on anything else. or follow whatever the convention of numpy is: >>> s = -256 >>> np.array((s,), dtype=np.uint8)[0] == s False >>> s = -1 >>> np.array((s,), dtype=np.uint8)[0] == s False Josef > > Key here is that we don't have the option of not breaking anything, because > there is a lot already broken. > > -Chris > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From josef.pktd at gmail.com Thu Jan 23 15:18:18 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 15:18:18 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin wrote: > On 23 January 2014 17:42, wrote: >> On Thu, Jan 23, 2014 at 12:13 PM, wrote: >>> On Thu, Jan 23, 2014 at 11:58 AM, wrote: >>>> >>>> No, a view doesn't change the memory, it just changes the >>>> interpretation and there shouldn't be any conversion involved. >>>> astype does type conversion, but it goes through ascii encoding which fails. >>>> >>>>>>> b = np.array(['?sc', 'zxc'], dtype='>>>>>> b.tostring() >>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>>>> b.view('S12') >>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >>>> dtype='|S12') >>>> >>>> The conversion happens somewhere in the array creation, but I have no >>>> idea about the memory encoding for uc2 and the low level layouts. >> >>>>> b = np.array(['?sc', 'zxc'], dtype='>>>> b[0].tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>>>> '?sc'.encode('utf-32LE') >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >> >> Is that the encoding for 'U' ? > > On a little-endian system, yes. I realise what' happening now. 'U' > represents unicode characters as a 32-bit unsigned integer giving the > code point of the character. The first 256 code points are exactly the > 256 characters representable with latin-1 in the same order. > > So '?' has the code point 0xd5 and is encoded as the byte 0xd5 in > latin-1. As a 32 bit integer the code point is 0x000000d5 but in > little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So > when you reinterpret that as 'S4' it strips the remaining nulls to get > the byte string b'\xd5'. Which is the latin-1 encoding for the > character. The same will happen for any string of latin-1 characters. > However if you do have a code point of 256 or greater then you'll get > a byte strings of length 2 or more. > > On a big-endian system I think you'd get b'\x00\x00\x00\xd5'. I curious consequence of this, if we have only 1 character elements: >>> a = np.array([si.encode('utf-16LE') for si in ['?', 'z']], dtype='S') >>> a32 = np.array([si.encode('utf-32LE') for si in ['?', 'z']], dtype='S') >>> a[0], a32[0] (b'\xd5', b'\xd5') >>> a[0] == a32[0] True >>> a32 = np.array([si.encode('utf-32BE') for si in ['?', 'z']], dtype='S') >>> a = np.array([si.encode('utf-16BE') for si in ['?', 'z']], dtype='S') >>> a[0], a32[0] (b'\x00\xd5', b'\x00\x00\x00\xd5') >>> a[0] == a32[0] False Josef > >> another sideeffect of null truncation: cannot decode truncated data >> >>>>> b.view('S4').tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>> b.view('S4')[0] >> b'\xd5' >>>>> b.view('S4')[0].tostring() >> b'\xd5' >>>>> b.view('S4')[:1].tostring() >> b'\xd5\x00\x00\x00' >> >>>>> b.view('S4')[0].decode('utf-32LE') >> Traceback (most recent call last): >> File "", line 1, in >> b.view('S4')[0].decode('utf-32LE') >> File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode >> return codecs.utf_32_le_decode(input, errors, True) >> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position >> 0: truncated data >> >>>>> b.view('S4')[:1].tostring().decode('utf-32LE') >> '?' >> >> numpy arrays need a decode and encode method > > I'm not sure that they do. Rather there needs to be a text dtype that > knows what encoding to use in order to have a binary interface as > exposed by .tostring() and friends and but produce unicode strings > when indexed from Python code. Having both a text and a binary > interface to the same data implies having an encoding. > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From oscar.j.benjamin at gmail.com Thu Jan 23 15:34:53 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Thu, 23 Jan 2014 20:34:53 +0000 Subject: [Numpy-discussion] Text array dtype for numpy Message-ID: There have been a few threads discussing the problems of how to do text with numpy arrays in Python 3. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. It is intended to be an illustration since I think that the real solution is a new dtype rather than an array subclass (so that it can be used in e.g. record arrays). The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings. I believe it would not be as hard to implement as the proposals for variable length string arrays. The one caveat is that it will strip null characters from the end of any string. I'm not 100% that the byte stripping encoding function will always work but it will for all the encodings I know and it seems to work with all the encodings that Python has. The code is inline below and attached (in case there are encoding problems with this message!): Oscar #!/usr/bin/env python3 from numpy import ndarray, array class textarray(ndarray): '''ndarray for holding encoded text. This is for demonstration purposes only. The real proposal is to specify the encoding as a dtype rather than a subclass. Only works as a 1-d array. >>> a = textarray(['qwert', 'zxcvb'], encoding='ascii') >>> a textarray(['qwert', 'zxcvb'], dtype='|S5:ascii') >>> a[0] 'qwert' >>> a.tostring() b'qwertzxcvb' >>> a[0] = 'qwe' # shorter string >>> a[0] 'qwe' >>> a.tostring() b'qwe\\x00\\x00zxcvb' >>> a[0] = 'qwertyuiop' # longer string Traceback (most recent call last): ... ValueError: Encoded bytes don't fit >>> b = textarray(['?scar', 'qwe'], encoding='utf-8') >>> b textarray(['?scar', 'qwe'], dtype='|S6:utf-8') >>> b[0] '?scar' >>> b[0].encode('utf-8') b'\\xc3\\x95scar' >>> b.tostring() b'\\xc3\\x95scarqwe\\x00\\x00\\x00' >>> c = textarray(['qwe'], encoding='utf-32-le') >>> c textarray(['qwe'], dtype='|S12:utf-32-le') ''' def __new__(cls, strings, encoding='utf-8'): bytestrings = [s.encode(encoding) for s in strings] a = array(bytestrings, dtype='S').view(textarray) a.encoding = encoding return a def __repr__(self): slist = ', '.join(repr(self[n]) for n in range(len(self))) return "textarray([%s], \n dtype='|S%d:%s')"\ % (slist, self.itemsize, self.encoding) def __getitem__(self, index): bstring = ndarray.__getitem__(self, index) return self._decode(bstring) def __setitem__(self, index, string): bstring = string.encode(self.encoding) if len(bstring) > self.itemsize: raise ValueError("Encoded bytes don't fit") ndarray.__setitem__(self, index, bstring) def _decode(self, b): b = b + b'\0' * (4 - len(b) % 4) s = b.decode(self.encoding) for n, c in enumerate(reversed(s)): if c != '\0': return s[:len(s)-n] return s if __name__ == "__main__": import doctest doctest.testmod() -------------- next part -------------- A non-text attachment was scrubbed... Name: textarray.py Type: text/x-python Size: 2215 bytes Desc: not available URL: From chris.barker at noaa.gov Thu Jan 23 16:51:14 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 13:51:14 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 12:10 PM, wrote: > > Exactly -- but what should those conversion/casting rules be? We can't > > decide that unless we decide if 'S' is for text or for arbitrary bytes > -- it > > can't be both. I say text, that's what it's mostly trying to do already. > But > > if it's bytes, fine, then some things still need cleaning up, and we > could > > really use a one-byte-text type. and if it's text, then we may need a > bytes > > dtype. > > (remember I'm just a balcony muppet) > me too ;-) > As far as I understand all codecs have the same ascii part. nope -- certainly not multi-byte codecs. And one of the key points of utf-8 is that the ascii part is compatible -- none of teh other full-unicode encoding are. many of the one-byte-per-char ones do share the ascii part, but not all, or not completely. So I would > cast on ascii and raise on anything else. > still a fine option -- clearly defined and quite useful for scientific text. However, I would prefer latin-1 -- that way you might get garbage for the non-ascii parts, but it wouldn't raise an exception and it round-trips through encoding/decoding. And you would have a somewhat more useful subset -- including the latin-language character and symbols like the degree symbol, etc. > or follow whatever the convention of numpy is: > > >>> s = -256 > >>> np.array((s,), dtype=np.uint8)[0] == s > False > >>> s = -1 > >>> np.array((s,), dtype=np.uint8)[0] == s > False > I think text is distinct enough from numbers that we don't need to do that same thing -- and this is result of well-defined casting rules built into the compiler (and hardware?) for the numeric types. I dont hink we have either the standard or compiler support for text conversions like that. -CHB PS: this is interesting, on py2: In [176]: a = np.array((2222,), dtype='S') In [177]: a Out[177]: array(['2'], dtype='|S1') It converts it to a string, but only grabs the first character? (is it determining the size before converting to a string? and this: In [182]: a = np.array(2222, dtype='S') In [183]: a Out[183]: array('2222', dtype='|S24') 24 ? where did that come from? > > Josef > > > > > Key here is that we don't have the option of not breaking anything, > because > > there is a lot already broken. > > > > -Chris > > > > > > -- > > > > Christopher Barker, Ph.D. > > Oceanographer > > > > Emergency Response Division > > NOAA/NOS/OR&R (206) 526-6959 voice > > 7600 Sand Point Way NE (206) 526-6329 fax > > Seattle, WA 98115 (206) 526-6317 main reception > > > > Chris.Barker at noaa.gov > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenny.stone125 at gmail.com Thu Jan 23 17:23:50 2014 From: jenny.stone125 at gmail.com (jennifer stone) Date: Fri, 24 Jan 2014 03:53:50 +0530 Subject: [Numpy-discussion] (no subject) Message-ID: Both scipy and numpy require GSOC > candidates to have a pull request accepted as part of the application > process. I'd suggest implementing a function not currently in scipy that > you think would be useful. That would also help in finding a mentor for the > summer. I'd also suggest getting familiar with cython. > > Chuck > Thanks a lot for the heads-up. I am yet to be familiarized with Cython and it indeed is playing a crucial role especially in the 'special' module > > I don't see you on github yet, are you there? If not, you should set up an > account to work in. See the developer guide > for some pointers. > > Chuck > I am present on github but the profile at present is just a mark of humble mistakes of a beginner to open-sourcing, The id is https://github.com/jennystone. I hope to build upon my profile. Jennifer -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenny.stone125 at gmail.com Thu Jan 23 17:58:37 2014 From: jenny.stone125 at gmail.com (jennifer stone) Date: Fri, 24 Jan 2014 04:28:37 +0530 Subject: [Numpy-discussion] (no subject) Message-ID: Scipy doesn't have a function for the Laplace transform, it has only a > Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage. > An inverse Laplace transform would be very welcome I'd think - it has real > world applications, and there's no good implementation in any open source > library as far as I can tell. It's probably doable, but not the easiest > topic for a GSoC I think. From what I can find, the paper "Numerical > Transform Inversion Using Gaussian Quadrature" from den Iseger contains > what's considered the current state of the art algorithm. Browsing that > gives a reasonable idea of the difficulty of implementing `ilaplace`. A brief scanning through the paper "Numerical Transform Inversion Using Gaussian Quadrature" from den Iseger does indicate the complexity of the algorithm. But GSoC project or not, can't we work on it, step by step? As I would love to see a contender for Matlab's ilaplace on open source front!! > > You can have a look at https://github.com/scipy/scipy/pull/2908/files for > ideas. Most of the things that need improving or we really think we should > have in Scipy are listed there. Possible topics are not restricted to that > list though - it's more important that you pick something you're interested > in and have the required background and coding skills for. > Thanks a lot for the roadmap. Of the options provided, I found the 'Cython'ization of Cluster great. Would it be possible to do it as the Summer project if I spend the month learning Cython? Regards Janani > Cheers, > Ralf > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vincent at vincentdavis.net Thu Jan 23 18:26:21 2014 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 23 Jan 2014 17:26:21 -0600 Subject: [Numpy-discussion] De Bruijn sequence Message-ID: I happen to be working with De Bruijn sequences. Is there any interest in this being part of numpy/scipy? https://gist.github.com/vincentdavis/8588879 Vincent Davis -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Thu Jan 23 18:56:36 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Thu, 23 Jan 2014 18:56:36 -0500 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 4:51 PM, Chris Barker wrote: > On Thu, Jan 23, 2014 at 12:10 PM, wrote: >> >> > Exactly -- but what should those conversion/casting rules be? We can't >> > decide that unless we decide if 'S' is for text or for arbitrary bytes >> > -- it >> > can't be both. I say text, that's what it's mostly trying to do already. >> > But >> > if it's bytes, fine, then some things still need cleaning up, and we >> > could >> > really use a one-byte-text type. and if it's text, then we may need a >> > bytes >> > dtype. >> >> (remember I'm just a balcony muppet) > > > me too ;-) > > >> >> As far as I understand all codecs have the same ascii part. > > > nope -- certainly not multi-byte codecs. And one of the key points of utf-8 > is that the ascii part is compatible -- none of teh other full-unicode > encoding are. > > many of the one-byte-per-char ones do share the ascii part, but not all, or > not completely. > >> So I would >> cast on ascii and raise on anything else. > > > still a fine option -- clearly defined and quite useful for scientific text. > However, I would prefer latin-1 -- that way you might get garbage for the > non-ascii parts, but it wouldn't raise an exception and it round-trips > through encoding/decoding. And you would have a somewhat more useful subset > -- including the latin-language character and symbols like the degree > symbol, etc. I'm not sure anymore, after all these threads I think bytes should be bytes and strings should be strings >>> x = np.array(['hugo'], 'S') Traceback (most recent call last): File "", line 1, in x = np.array(['hugo'], float) ValueError: could not convert string to bytes: 'hugo' >>> x = np.array([b'hugo'], 'S') >>> but with support for textarrays as Oscars showed, to make it easy to convert between the 'S' and 'S:encoding' or use either view on the memory. I like the idea of an `encoding_view` on some 'S' bytes, and once we have a view like that there is no reason to pretend 'S' bytes are text. > >> >> or follow whatever the convention of numpy is: >> >> >>> s = -256 >> >>> np.array((s,), dtype=np.uint8)[0] == s >> False >> >>> s = -1 >> >>> np.array((s,), dtype=np.uint8)[0] == s >> False > > > I think text is distinct enough from numbers that we don't need to do that > same thing -- and this is result of well-defined casting rules built into > the compiler (and hardware?) for the numeric types. I dont hink we have > either the standard or compiler support for text conversions like that. > > -CHB > > PS: this is interesting, on py2: > > > In [176]: a = np.array((2222,), dtype='S') > > In [177]: a > Out[177]: > array(['2'], > dtype='|S1') > > It converts it to a string, but only grabs the first character? (is it > determining the size before converting to a string? I recently fixed a bug in statsmodels based on this. I don't know why the code worked before, I assume it used string integers instead of integers at some point when it was written > > and this: > > In [182]: a = np.array(2222, dtype='S') > > In [183]: a > Out[183]: > array('2222', > dtype='|S24') > > 24 ? where did that come from? No idea. Unless I missed something when I didn't pay attention, there never before was any discussion on the mailing list about bytes versus strings in python 3 in numpy (I don't follow numpy's "issues"). And I neither remember (m)any public complaints about the behavior of the 'S' type in strange cases. maybe I didn't pay attention because I didn't care, until we ran into the python 3 problems. maybe nobody else did either. Josef > > > > > > > > > > > >> >> >> Josef >> >> > >> > Key here is that we don't have the option of not breaking anything, >> > because >> > there is a lot already broken. >> > >> > -Chris >> > >> > >> > -- >> > >> > Christopher Barker, Ph.D. >> > Oceanographer >> > >> > Emergency Response Division >> > NOAA/NOS/OR&R (206) 526-6959 voice >> > 7600 Sand Point Way NE (206) 526-6329 fax >> > Seattle, WA 98115 (206) 526-6317 main reception >> > >> > Chris.Barker at noaa.gov >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion at scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From oscar.j.benjamin at gmail.com Thu Jan 23 19:02:26 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 24 Jan 2014 00:02:26 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On 23 January 2014 21:51, Chris Barker wrote: > > However, I would prefer latin-1 -- that way you might get garbage for the > non-ascii parts, but it wouldn't raise an exception and it round-trips > through encoding/decoding. And you would have a somewhat more useful subset > -- including the latin-language character and symbols like the degree > symbol, etc. Exceptions and error messages are a good thing! Garbage is not!!! :) Oscar From chris.barker at noaa.gov Thu Jan 23 20:09:28 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 17:09:28 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin wrote: > On 23 January 2014 21:51, Chris Barker wrote: > > > > However, I would prefer latin-1 -- that way you might get garbage for > the > > non-ascii parts, but it wouldn't raise an exception and it round-trips > > through encoding/decoding. And you would have a somewhat more useful > subset > > -- including the latin-language character and symbols like the degree > > symbol, etc. > > Exceptions and error messages are a good thing! Garbage is not!!! :) > in principle, I agree with you, but sometime practicality beets purity. in py2 there is a lot of implicit encoding/decoding going on, using the system encoding. That is ascii on a lot of systems. The result is that there is a lot of code out there that folks have ported to use unicode, but missed a few corners. If that code is only testes with ascii, it all seems o be working but then out in the wild someone puts another character in there and presto -- a crash. Also, there are places where the inability to encode makes silent message -- for instance if an Exception is raised with a unicode message, it will get silently dropped when it comes time to display on the terminal. I spent quite a wile banging my head against that one recently when I tried to update some code to read unicode files. I would have been MUCH happier with a bit of garbage in the mesae than having it drop (or raise an encoding error in the middle of the error...) I think this is a bad thing. The advantage of latin-1 is that while you might get something that doesn't print right, it won't crash, and it won't contaminate the data, so comparisons, etc, will still work. kind of like using utf-8 in an old-style c char array -- you can still passi t around and copare it, even if the bytes dont mean what you think they do. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Jan 23 20:12:35 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 23 Jan 2014 17:12:35 -0800 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 3:56 PM, wrote: > > I'm not sure anymore, after all these threads I think bytes should be > bytes and strings should be strings > exactly -- that's the py3 model, and I think we really soudl try to conform to it, it's really the only way to have a robust solution. > I like the idea of an `encoding_view` on some 'S' bytes, and once we > have a view like that there is no reason to pretend 'S' bytes are > text. right, then they are bytes, not text. period. I'm not sure if we should conflate encoded text and arbitrary bytes, but it does make sense to build encoded text on a bytes object. maybe I didn't pay attention because I didn't care, until we ran into > the python 3 problems. maybe nobody else did either. > yup -- I think this didn't get a whole lot of review or testing.... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscar.j.benjamin at gmail.com Thu Jan 23 20:41:24 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Fri, 24 Jan 2014 01:41:24 +0000 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On 24 January 2014 01:09, Chris Barker wrote: > On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin > wrote: >> >> On 23 January 2014 21:51, Chris Barker wrote: >> > >> > However, I would prefer latin-1 -- that way you might get garbage for >> > the >> > non-ascii parts, but it wouldn't raise an exception and it round-trips >> > through encoding/decoding. And you would have a somewhat more useful >> > subset >> > -- including the latin-language character and symbols like the degree >> > symbol, etc. >> >> Exceptions and error messages are a good thing! Garbage is not!!! :) > > in principle, I agree with you, but sometime practicality beets purity. > > in py2 there is a lot of implicit encoding/decoding going on, using the > system encoding. That is ascii on a lot of systems. The result is that there > is a lot of code out there that folks have ported to use unicode, but missed > a few corners. If that code is only testes with ascii, it all seems o be > working but then out in the wild someone puts another character in there and > presto -- a crash. Precisely. The Py3 text model uses TypeErrors to warn early against this kind of thing. No longer do you have code that seems to work until the wrong character goes in. You get the error straight away when you try to mix bytes and text. You still have the option to silence those errors: it just needs to be done explicitly: >>> s = '?scar' >>> s.encode('ascii', errors='replace') b'?scar' > Also, there are places where the inability to encode makes silent message -- > for instance if an Exception is raised with a unicode message, it will get > silently dropped when it comes time to display on the terminal. I spent > quite a wile banging my head against that one recently when I tried to > update some code to read unicode files. I would have been MUCH happier with > a bit of garbage in the mesae than having it drop (or raise an encoding > error in the middle of the error...) Yeah, that's just a bug in CPython. I think it's fixed now but either way you're right: for the particular case of displaying error messages the interpreter should do whatever it takes to get some kind of error message out even if it's a bit garbled. I disagree that this should be the basis for ordinary data processing with numpy though. > I think this is a bad thing. > > The advantage of latin-1 is that while you might get something that doesn't > print right, it won't crash, and it won't contaminate the data, so > comparisons, etc, will still work. kind of like using utf-8 in an old-style > c char array -- you can still passi t around and copare it, even if the > bytes dont mean what you think they do. It round trips okay as long as you don't try to do anything else with the string. So does the textarray class I proposed in a new thread: If you just use fromfile and tofile it works fine for any input (except for trailing nulls) but if you try to decode invalid bytes it will throw errors. It wouldn't be hard to add configurable error-handling there either. Oscar From dineshbvadhia at hotmail.com Fri Jan 24 09:13:02 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Fri, 24 Jan 2014 06:13:02 -0800 Subject: [Numpy-discussion] vstack and hstack performance penalty Message-ID: When using vstack or hstack for large arrays, are there any performance penalties eg. takes longer time-wise or makes a copy of an array during operation ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Fri Jan 24 09:58:26 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Fri, 24 Jan 2014 15:58:26 +0100 Subject: [Numpy-discussion] vstack and hstack performance penalty In-Reply-To: References: Message-ID: <1390575506.7837.7.camel@sebastian-laptop> On Fri, 2014-01-24 at 06:13 -0800, Dinesh Vadhia wrote: > When using vstack or hstack for large arrays, are there any > performance penalties eg. takes longer time-wise or makes a copy of an > array during operation ? No, they all use concatenate. There are only constant overheads on top of the necessary data copying. Though performance may vary because of memory order, etc. - Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From dineshbvadhia at hotmail.com Fri Jan 24 10:19:09 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Fri, 24 Jan 2014 07:19:09 -0800 Subject: [Numpy-discussion] Catching out-of-memory error before it happens Message-ID: I want to write a general exception handler to warn if too much data is being loaded for the ram size in a machine for a successful numpy array operation to take place. For example, the program multiplies two floating point arrays A and B which are populated with loadtext. While the data is being loaded, want to continuously check that the data volume doesn't pass a threshold that will cause on out-of-memory error during the A*B operation. The known variables are the amount of memory available, data type (floats in this case) and the numpy array operation to be performed. It seems this requires knowledge of the internal memory requirements of each numpy operation. For sake of simplicity, can ignore other memory needs of program. Is this possible? -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 24 10:30:34 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jan 2014 15:30:34 +0000 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: Message-ID: There is no reliable way to predict how much memory an arbitrary numpy operation will need, no. However, in most cases the main memory cost will be simply the need to store the input and output arrays; for large arrays, all other allocations should be negligible. The most effective way to avoid running out of memory, therefore, is to avoid creating temporary arrays, by using only in-place operations. E.g., if a and b each require N bytes of ram, then memory requirements (roughly). c = a + b: 3N c = a + 2*b: 4N a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. Of course some calculations do require the use of temporary storage space... -n On 24 Jan 2014 15:19, "Dinesh Vadhia" wrote: > I want to write a general exception handler to warn if too much data is > being loaded for the ram size in a machine for a successful numpy array > operation to take place. For example, the program multiplies two floating > point arrays A and B which are populated with loadtext. While the data is > being loaded, want to continuously check that the data volume doesn't pass > a threshold that will cause on out-of-memory error during the A*B > operation. The known variables are the amount of memory available, data > type (floats in this case) and the numpy array operation to be performed. > It seems this requires knowledge of the internal memory requirements of > each numpy operation. For sake of simplicity, can ignore other memory > needs of program. Is this possible? > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From francesc at continuum.io Fri Jan 24 10:33:45 2014 From: francesc at continuum.io (Francesc Alted) Date: Fri, 24 Jan 2014 16:33:45 +0100 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: Message-ID: <52E287D9.4030507@continuum.io> Yeah, numexpr is pretty cool for avoiding temporaries in an easy way: https://github.com/pydata/numexpr Francesc El 24/01/14 16:30, Nathaniel Smith ha escrit: > > There is no reliable way to predict how much memory an arbitrary numpy > operation will need, no. However, in most cases the main memory cost > will be simply the need to store the input and output arrays; for > large arrays, all other allocations should be negligible. > > The most effective way to avoid running out of memory, therefore, is > to avoid creating temporary arrays, by using only in-place operations. > > E.g., if a and b each require N bytes of ram, then memory requirements > (roughly). > > c = a + b: 3N > c = a + 2*b: 4N > a += b: 2N > np.add(a, b, out=a): 2N > b *= 2; a += b: 2N > > Note that simply loading a and b requires 2N memory, so the latter > code samples are near-optimal. > > Of course some calculations do require the use of temporary storage > space... > > -n > > On 24 Jan 2014 15:19, "Dinesh Vadhia" > wrote: > > I want to write a general exception handler to warn if too much > data is being loaded for the ram size in a machine for a > successful numpy array operation to take place. For example, the > program multiplies two floating point arrays A and B which are > populated with loadtext. While the data is being loaded, want to > continuously check that the data volume doesn't pass a threshold > that will cause on out-of-memory error during the A*B operation. > The known variables are the amount of memory available, data type > (floats in this case) and the numpy array operation to be > performed. It seems this requires knowledge of the internal memory > requirements of each numpy operation. For sake of simplicity, can > ignore other memory needs of program. Is this possible? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 24 10:57:14 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Fri, 24 Jan 2014 07:57:14 -0800 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: Message-ID: <-1989644387440925520@unknownmsgid> c = a + b: 3N c = a + 2*b: 4N Does python garbage collect mid-expression? I.e. : C = (a + 2*b) + b 4 or 5 N? Also note that when memory gets tight, fragmentation can be a problem. I.e. if two size-n arrays where just freed, you still may not be able to allocate a size-2n array. This seems to be worse on windows, not sure why. a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. And will run quite a bit faster for large arrays--pushing that memory around takes time. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From dineshbvadhia at hotmail.com Fri Jan 24 11:01:48 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Fri, 24 Jan 2014 08:01:48 -0800 Subject: [Numpy-discussion] vstack and hstack performance penalty In-Reply-To: <1390575506.7837.7.camel@sebastian-laptop> References: <1390575506.7837.7.camel@sebastian-laptop> Message-ID: If A is very large and B is very small then np.concatenate(A, B) will copy B's data over to A which would take less time than the other way around - is that so? Does 'memory order' mean that it depends on sufficient contiguous memory being available for B otherwise it will be fragmented or something else? From robert.kern at gmail.com Fri Jan 24 11:21:07 2014 From: robert.kern at gmail.com (Robert Kern) Date: Fri, 24 Jan 2014 16:21:07 +0000 Subject: [Numpy-discussion] vstack and hstack performance penalty In-Reply-To: References: <1390575506.7837.7.camel@sebastian-laptop> Message-ID: On Fri, Jan 24, 2014 at 4:01 PM, Dinesh Vadhia wrote: > > If A is very large and B is very small then np.concatenate(A, B) will copy > B's data over to A which would take less time than the other way around - is > that so? No, neither array is modified in-place. A new array is created and both A and B are copied into it. The order is largely unimportant. > Does 'memory order' mean that it depends on sufficient contiguous > memory being available for B otherwise it will be fragmented or something > else? No, the output is never fragmented. numpy arrays may be strided, but never fragmented arbitrarily to fit into a fragmented address space. http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray The issue is what axis the concatenation happens on. If it's the first axis (and both inputs are contiguous), then it only takes two memcpy() calls to copy the data, one for each input, because the regions where they go into the output are juxtaposed. If you concatenate on one of the other axes, though, then the memory regions for A and B will be interleaved and you have to do 2*N memory copies (N being some number depending on the shape). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 24 11:25:37 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jan 2014 16:25:37 +0000 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: <-1989644387440925520@unknownmsgid> References: <-1989644387440925520@unknownmsgid> Message-ID: On 24 Jan 2014 15:57, "Chris Barker - NOAA Federal" wrote: > > >> c = a + b: 3N >> c = a + 2*b: 4N > > Does python garbage collect mid-expression? I.e. : > > C = (a + 2*b) + b > > 4 or 5 N? It should be collected as soon as the reference gets dropped, so 4N. (This is the advantage of a greedy refcounting collector.) > Also note that when memory gets tight, fragmentation can be a problem. I.e. if two size-n arrays where just freed, you still may not be able to allocate a size-2n array. This seems to be worse on windows, not sure why. If your arrays are big enough that you're worried that making a stray copy will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc will give each array its own virtual mapping, which can be backed by discontinuous physical memory. (I guess it's possible windows has a somehow shoddy VM system and this isn't true, but that seems unlikely these days?) Memory fragmentation is more a problem if you're allocating lots of small objects of varying sizes. On 32 bit, virtual address fragmentation could also be a problem, but if you're working with giant data sets then you need 64 bits anyway :-). -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From emanuele at relativita.com Fri Jan 24 11:30:33 2014 From: emanuele at relativita.com (Emanuele Olivetti) Date: Fri, 24 Jan 2014 17:30:33 +0100 Subject: [Numpy-discussion] np.array creation: unexpected behaviour Message-ID: <52E29529.9080705@relativita.com> Hi, I just came across this unexpected behaviour when creating a np.array() from two other np.arrays of different shape. Have a look at this example: ---- import numpy as np a = np.zeros(3) b = np.zeros((2,3)) c = np.zeros((3,2)) ab = np.array([a, b]) print ab.shape, ab.dtype ac = np.array([a, c], dtype=np.object) print ac.shape, ac.dtype ac_no_dtype = np.array([a, c]) print ac_no_dtype.shape, ac_no_dtype.dtype ---- The output, with NumPy v1.6.1 (Ubuntu 12.04) is: ---- (2,) object (2, 3) object Traceback (most recent call last): File "/tmp/numpy_bug.py", line 9, in ac_no_dtype = np.array([a, c]) ValueError: setting an array element with a sequence. ---- The result for 'ab' is what I expect. The one for 'ac' is a bit surprising. The one for ac_no_dtype even is more surprising. Is this an expected behaviour? Best, Emanuele From josef.pktd at gmail.com Fri Jan 24 11:46:39 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 24 Jan 2014 11:46:39 -0500 Subject: [Numpy-discussion] np.array creation: unexpected behaviour In-Reply-To: <52E29529.9080705@relativita.com> References: <52E29529.9080705@relativita.com> Message-ID: On Fri, Jan 24, 2014 at 11:30 AM, Emanuele Olivetti wrote: > Hi, > > I just came across this unexpected behaviour when creating > a np.array() from two other np.arrays of different shape. > Have a look at this example: > ---- > import numpy as np > a = np.zeros(3) > b = np.zeros((2,3)) > c = np.zeros((3,2)) > ab = np.array([a, b]) > print ab.shape, ab.dtype > ac = np.array([a, c], dtype=np.object) > print ac.shape, ac.dtype > ac_no_dtype = np.array([a, c]) > print ac_no_dtype.shape, ac_no_dtype.dtype > ---- > The output, with NumPy v1.6.1 (Ubuntu 12.04) is: > ---- > (2,) object > (2, 3) object > Traceback (most recent call last): > File "/tmp/numpy_bug.py", line 9, in > ac_no_dtype = np.array([a, c]) > ValueError: setting an array element with a sequence. > ---- > > The result for 'ab' is what I expect. The one for 'ac' is > a bit surprising. The one for ac_no_dtype even > is more surprising. > > Is this an expected behaviour? the exception in ac_no_dtype is what I always expected, since it's not a rectangular array. It usually happened when I make a mistake. **Unfortunately** in newer numpy version it will also create an object array. AFAIR Josef > > Best, > > Emanuele > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From dineshbvadhia at hotmail.com Fri Jan 24 12:19:15 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Fri, 24 Jan 2014 09:19:15 -0800 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: Message-ID: So, with the example case, the approximate memory cost for an in-place operation would be: A *= B : 2N But, if the original A or B is to remain unchanged then it will be: C = A * B : 3N ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 24 12:23:02 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jan 2014 17:23:02 +0000 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: Message-ID: Yes. On 24 Jan 2014 17:19, "Dinesh Vadhia" wrote: > So, with the example case, the approximate memory cost for an in-place > operation would be: > > A *= B : 2N > > But, if the original A or B is to remain unchanged then it will be: > > C = A * B : 3N ? > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dineshbvadhia at hotmail.com Fri Jan 24 17:09:21 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Fri, 24 Jan 2014 14:09:21 -0800 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: <52E287D9.4030507@continuum.io> References: <52E287D9.4030507@continuum.io> Message-ID: Francesc: Thanks. I looked at numexpr a few years back but it didn't support array slicing/indexing. Has that changed? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Fri Jan 24 17:25:10 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Fri, 24 Jan 2014 23:25:10 +0100 Subject: [Numpy-discussion] (no subject) In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 11:58 PM, jennifer stone wrote: > > > > > Scipy doesn't have a function for the Laplace transform, it has only a >> Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage. >> An inverse Laplace transform would be very welcome I'd think - it has real >> world applications, and there's no good implementation in any open source >> library as far as I can tell. It's probably doable, but not the easiest >> topic for a GSoC I think. From what I can find, the paper "Numerical >> Transform Inversion Using Gaussian Quadrature" from den Iseger contains >> what's considered the current state of the art algorithm. Browsing that >> gives a reasonable idea of the difficulty of implementing `ilaplace`. > > > A brief scanning through the paper "Numerical Transform Inversion Using > Gaussian Quadrature" from den Iseger does indicate the complexity of the > algorithm. But GSoC project or not, can't we work on it, step by step? As I > would love to see a contender for Matlab's ilaplace on open source front!! > Yes, it would be quite nice to have. So if you're interested, by all means give it a go. An issue for a GSoC will be how to maximize the chance of success - typically merging smaller PRs frequently helps a lot in that respect, but we can't merge an ilaplace implementation step by step. > You can have a look at https://github.com/scipy/scipy/pull/2908/files for >> ideas. Most of the things that need improving or we really think we should >> have in Scipy are listed there. Possible topics are not restricted to that >> list though - it's more important that you pick something you're >> interested >> in and have the required background and coding skills for. >> > > Thanks a lot for the roadmap. Of the options provided, I found the > 'Cython'ization of Cluster great. Would it be possible to do it as the > Summer project if I spend the month learning Cython? > There are a couple of things to consider. Your proposal should be neither too easy nor too ambitious for one summer. Cythonizing cluster is probably not enough for a full summer of work, especially if you can re-use some Cython code that David WF or other people already have. So some new functionality can be added to your proposal. The other important point is that you need to find a mentor. Cluster is one of the smaller modules that doesn't see a lot of development and most of the core devs may not know so well. A good proposal may help find an interested mentor. I suggest you start early with a draft proposal, and iterate a few times based on feedback on this list. You may want to have a look at your email client settings by the way, your replies seem to start new threads. Cheers, Ralf > Regards > Janani > > > >> Cheers, >> Ralf >> >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 24 17:29:19 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 24 Jan 2014 14:29:19 -0800 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: <-1989644387440925520@unknownmsgid> Message-ID: On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith wrote: > If your arrays are big enough that you're worried that making a stray copy > will ENOMEM, then you *shouldn't* have to worry about fragmentation - > malloc will give each array its own virtual mapping, which can be backed by > discontinuous physical memory. (I guess it's possible windows has a somehow > shoddy VM system and this isn't true, but that seems unlikely these days?) > All I know is that when I push the limits with memory on a 32 bit Windows system, it often crashed out when I've never seen more than about 1GB of memory use by the application -- I would have thought that would be plenty of overhead. I also know that I've reached limits onWindows32 well before OS_X 32, but that may be because IIUC, Windows32 only allows 2GB per process, whereas OS-X32 allows 4GB per process. Memory fragmentation is more a problem if you're allocating lots of small > objects of varying sizes. > It could be that's what I've been doing.... On 32 bit, virtual address fragmentation could also be a problem, but if > you're working with giant data sets then you need 64 bits anyway :-). > well, "giant" is defined relative to the system capabilities... but yes, if you're pushing the limits of a 32 bit system , the easiest thing to do is go to 64bits and some more memory! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jan 24 17:43:46 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 24 Jan 2014 14:43:46 -0800 Subject: [Numpy-discussion] Text array dtype for numpy In-Reply-To: References: Message-ID: Oscar, Cool stuff, thanks! I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. To make a slightly more concrete proposal, I've implemented a pure > Python ndarray subclass that I believe can consistently handle > text/bytes in Python 3. this scares me right there -- is it text or bytes??? We really don't want something that is both. > The idea is that the array has an encoding. It stores strings as > bytes. The bytes are encoded/decoded on insertion/access. Methods > accessing the binary content of the array will see the encoded bytes. > Methods accessing the elements of the array will see unicode strings. > > I believe it would not be as hard to implement as the proposals for > variable length string arrays. except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users "why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?) > The one caveat is that it will strip > null characters from the end of any string. which is fatal, but you do want a new dtype after all, which presumably wouldn't do that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 24 18:09:01 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 24 Jan 2014 23:09:01 +0000 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: <-1989644387440925520@unknownmsgid> Message-ID: On Fri, Jan 24, 2014 at 10:29 PM, Chris Barker wrote: > On Fri, Jan 24, 2014 at 8:25 AM, Nathaniel Smith wrote: >> >> If your arrays are big enough that you're worried that making a stray copy >> will ENOMEM, then you *shouldn't* have to worry about fragmentation - malloc >> will give each array its own virtual mapping, which can be backed by >> discontinuous physical memory. (I guess it's possible windows has a somehow >> shoddy VM system and this isn't true, but that seems unlikely these days?) > > All I know is that when I push the limits with memory on a 32 bit Windows > system, it often crashed out when I've never seen more than about 1GB of > memory use by the application -- I would have thought that would be plenty > of overhead. > > I also know that I've reached limits onWindows32 well before OS_X 32, but > that may be because IIUC, Windows32 only allows 2GB per process, whereas > OS-X32 allows 4GB per process. > >> Memory fragmentation is more a problem if you're allocating lots of small >> objects of varying sizes. > > It could be that's what I've been doing.... > >> On 32 bit, virtual address fragmentation could also be a problem, but if >> you're working with giant data sets then you need 64 bits anyway :-). > > well, "giant" is defined relative to the system capabilities... but yes, if > you're pushing the limits of a 32 bit system , the easiest thing to do is > go to 64bits and some more memory! Oh, yeah, common confusion. Allowing 2 GiB of address space per process doesn't mean you can actually practically use 2 GiB of *memory* per process, esp. if you're allocating/deallocating a mix of large and small objects, because address space fragmentation will kill you way before that. The memory is there, there isn't anywhere to slot it into the process's address space. So you don't need to add more memory, just switch to a 64-bit OS. On 64-bit you have oodles of address space, so the memory manager can easily slot in large objects far away from small objects, and it's only fragmentation within each small-object arena that hurts. A good malloc will keep this overhead down pretty low though -- certainly less than the factor of two you're thinking about. -n From sebastian at sipsolutions.net Fri Jan 24 19:05:15 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sat, 25 Jan 2014 01:05:15 +0100 Subject: [Numpy-discussion] Comparison changes Message-ID: <1390608315.7837.22.camel@sebastian-laptop> Hi all, in https://github.com/numpy/numpy/pull/3514 I proposed some changes to the comparison operators. This includes: 1. Comparison with None will broadcast in the future, so that `arr == None` will actually compare all elements to None. (A FutureWarning for now) 2. I added that == and != will give FutureWarning when an error was raised. In the future they should not silence these errors anymore. (For example shape mismatches) 3. We used to use PyObject_RichCompareBool for equality which includes an identity check. I propose to not do that identity check since we have elementwise equality (returning an object array for objects would be nice in some ways, but I think that is only an option for a dedicated function). The reason is that for example >>> a = np.array([np.array([1, 2, 3]), 1]) >>> b = np.array([np.array([1, 2, 3]), 1]) >>> a == b will happen to work if it happens to be that `a[0] is b[0]`. This currently has no deprecation, since the logic is in the inner loop and I am not sure if it is easy to add well there. Are there objections/comments to these changes? Regards, Sebastian From njs at pobox.com Fri Jan 24 19:18:12 2014 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 25 Jan 2014 00:18:12 +0000 Subject: [Numpy-discussion] Comparison changes In-Reply-To: <1390608315.7837.22.camel@sebastian-laptop> References: <1390608315.7837.22.camel@sebastian-laptop> Message-ID: On 25 Jan 2014 00:05, "Sebastian Berg" wrote: > > Hi all, > > in https://github.com/numpy/numpy/pull/3514 I proposed some changes to > the comparison operators. This includes: > > 1. Comparison with None will broadcast in the future, so that `arr == > None` will actually compare all elements to None. (A FutureWarning for > now) > > 2. I added that == and != will give FutureWarning when an error was > raised. In the future they should not silence these errors anymore. (For > example shape mismatches) This can just be a DeprecationWarning, because the only change is to raise new more errors. > 3. We used to use PyObject_RichCompareBool for equality which includes > an identity check. I propose to not do that identity check since we have > elementwise equality (returning an object array for objects would be > nice in some ways, but I think that is only an option for a dedicated > function). The reason is that for example > > >>> a = np.array([np.array([1, 2, 3]), 1]) > >>> b = np.array([np.array([1, 2, 3]), 1]) > >>> a == b > > will happen to work if it happens to be that `a[0] is b[0]`. This > currently has no deprecation, since the logic is in the inner loop and I > am not sure if it is easy to add well there. Surely any environment where we can call PyObject_RichCompareBool is an environment where we can issue a warning...? -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan at sun.ac.za Fri Jan 24 19:34:12 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Sat, 25 Jan 2014 01:34:12 +0100 Subject: [Numpy-discussion] np.array creation: unexpected behaviour In-Reply-To: <52E29529.9080705@relativita.com> References: <52E29529.9080705@relativita.com> Message-ID: <20140125003412.GL23850@gmail.com> On Fri, 24 Jan 2014 17:30:33 +0100, Emanuele Olivetti wrote: > I just came across this unexpected behaviour when creating > a np.array() from two other np.arrays of different shape. The tuple parsing for the construction of new numpy arrays is pretty tricky/hairy, and doesn't always do exactly what you'd expect. The easiest workaround is probably to pre-allocate the array: In [24]: data = [a, c] In [25]: x = np.empty(len(data), dtype=object) In [26]: x[:] = data In [27]: x.shape Out[27]: (2,) Regards St?fan From josef.pktd at gmail.com Fri Jan 24 23:19:44 2014 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 24 Jan 2014 23:19:44 -0500 Subject: [Numpy-discussion] Text array dtype for numpy In-Reply-To: References: Message-ID: On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker wrote: > Oscar, > > Cool stuff, thanks! > > I'm wondering though what the use-case really is. The P3 text model > (actually the py2 one, too), is quite clear that you want users to think of, > and work with, text as text -- and not care how things are encoding in the > underlying implementation. You only want the user to think about encodings > on I/O -- transferring stuff between systems where you can't avoid it. And > you might choose different encodings based on different needs. > > So why have a different, the-user-needs-to-think-about-encodings numpy > dtype? We already have 'U' for full-on unicode support for text. There is a > good argument for a more compact internal representation for text compatible > with one-byte-per-char encoding, thus the suggestion for such a dtype. But I > don't see the need for quite this. Maybe I'm not being a creative enough > thinker. In my opinion something like Oscar's class would be very useful (with some adjustments, especially making it easy to create an S view or put a encoding view on top of an S array). (Disclaimer: My only experience is in converting some examples in statsmodels to bytes in py 3 and to play with some examples.) My guess is that 'S'/bytes is very convenient for library code, because it doesn't care about encodings (assuming we have enough control that all bytes are in the same encoding), and we don't have any overhead to convert to strings when comparing or working with "byte strings". 'S' is also very flexible because it doesn't tie us down to a minimum size for the encoding nor any specific encoding. The problem of 'S'/bytes is in input output and interactive work, as in the examples of Tom Aldcroft. The textarray dtype would allow us to view any 'S' array so we can have text/string interaction with python and get the correct encoding on input and output. Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any other world, you could get your favorite minimal memory S/bytes/strings. I think this is useful as a complement to the current 'S' type, and to make that more useful on python 3, independent of what other small memory unicode dtype with predefined encoding numpy could get. > > Also, we may want numpy to interact at a low level with other libs that > might have binary encoded text (HDF, etc) -- in which case we need a bytes > dtype that can store that data, and perhaps encoding and decoding ufuncs. > > If we want a more efficient and compact unicode implementation then the py3 > one is a good place to start -it's pretty slick! Though maybe harder to due > in numpy as text in numpy probably wouldn't be immutable. > >> To make a slightly more concrete proposal, I've implemented a pure >> Python ndarray subclass that I believe can consistently handle >> text/bytes in Python 3. > > > this scares me right there -- is it text or bytes??? We really don't want > something that is both. Most users won't care about the internal representation of anything. But when we want or find it useful we can view the memory with any compatible dtype. That is, with numpy we always have also raw "bytes". And there are lot's of ways to shoot yourself why would you want to to that? : >>> a = np.arange(5) >>> b = a.view('S4') >>> b[1] = 'h' >>> a array([ 0, 104, 2, 3, 4]) >>> a[1] = 'h' Traceback (most recent call last): File "", line 1, in a[1] = 'h' ValueError: invalid literal for int() with base 10: 'h' > >> >> The idea is that the array has an encoding. It stores strings as >> bytes. The bytes are encoded/decoded on insertion/access. Methods >> accessing the binary content of the array will see the encoded bytes. >> Methods accessing the elements of the array will see unicode strings. >> >> I believe it would not be as hard to implement as the proposals for >> variable length string arrays. > > > except that with some encodings, the number of bytes required is a function > of what the content of teh text is -- so it either has to be variable > length, or a fixed number of bytes, which is not a fixed number of > characters which require both careful truncation (a pain), and surprising > results for users "why can't I fit 10 characters is a length-10 text > object? And I can if they are different characters?) not really different to other places where you have to pay attention to the underlying dtype, and a question of providing the underlying information. (like itemsize) 1 - 1e-20 I had code like that when I wasn't thinking properly or wasn't paying enough attention to what I was typing. > >> >> The one caveat is that it will strip >> null characters from the end of any string. > > > which is fatal, but you do want a new dtype after all, which presumably > wouldn't do that. The only place so far that I found where this really hurts is in the decode examples (with utf32LE for example). That's why I think numpy needs to have decode/encode functions, so it can access the bytes before they are null truncated, besides being vectorized. BTW: I wanted to start a new thread "in defence of (null truncated) 'S' string bytes", but I ran into too many other issues to work out the examples. Josef > > -Chris > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From sebastian at sipsolutions.net Sat Jan 25 05:25:50 2014 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sat, 25 Jan 2014 11:25:50 +0100 Subject: [Numpy-discussion] Comparison changes In-Reply-To: References: <1390608315.7837.22.camel@sebastian-laptop> Message-ID: <1390645550.7837.27.camel@sebastian-laptop> On Sat, 2014-01-25 at 00:18 +0000, Nathaniel Smith wrote: > On 25 Jan 2014 00:05, "Sebastian Berg" > wrote: > > > > Hi all, > > > > in https://github.com/numpy/numpy/pull/3514 I proposed some changes > to > > the comparison operators. This includes: > > > > 1. Comparison with None will broadcast in the future, so that `arr > == > > None` will actually compare all elements to None. (A FutureWarning > for > > now) > > > > 2. I added that == and != will give FutureWarning when an error was > > raised. In the future they should not silence these errors anymore. > (For > > example shape mismatches) > > This can just be a DeprecationWarning, because the only change is to > raise new more errors. > Right, already is the case. > > 3. We used to use PyObject_RichCompareBool for equality which > includes > > an identity check. I propose to not do that identity check since we > have > > elementwise equality (returning an object array for objects would be > > nice in some ways, but I think that is only an option for a > dedicated > > function). The reason is that for example > > > > >>> a = np.array([np.array([1, 2, 3]), 1]) > > >>> b = np.array([np.array([1, 2, 3]), 1]) > > >>> a == b > > > > will happen to work if it happens to be that `a[0] is b[0]`. This > > currently has no deprecation, since the logic is in the inner loop > and I > > am not sure if it is easy to add well there. > > Surely any environment where we can call PyObject_RichCompareBool is > an environment where we can issue a warning...? > Right, I suppose an extra identity check and comparing it with the other result is indeed no problem. So I think I will add that. - Sebastian > -n > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From davidmenhur at gmail.com Sat Jan 25 10:48:05 2014 From: davidmenhur at gmail.com (=?UTF-8?B?RGHPgGlk?=) Date: Sat, 25 Jan 2014 16:48:05 +0100 Subject: [Numpy-discussion] Catching out-of-memory error before it happens In-Reply-To: References: <52E287D9.4030507@continuum.io> Message-ID: On 24 January 2014 23:09, Dinesh Vadhia wrote: > Francesc: Thanks. I looked at numexpr a few years back but it didn't > support array slicing/indexing. Has that changed? > No, but you can do it yourself. big_array = np.empty(20000) piece = big_array[30:-50] ne.evaluate('sqrt(piece)') Here, creating "piece" does not increase memory use, as slicing shares the original data (well, actually, it adds a mere 80 bytes, the overhead of an array). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jan 25 11:33:40 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 25 Jan 2014 09:33:40 -0700 Subject: [Numpy-discussion] using loadtxt to load a text file in to a numpy array In-Reply-To: References: <20140122104646.GA2555@gmail.com> <20140122211328.GA1938@gmail.com> <7210200738621770529@unknownmsgid> <20140123104520.GA2300@gmail.com> <20140123164305.GA5688@gmail.com> Message-ID: On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker wrote: > Thanks for poking into this all. I've lost track a bit, but I think: > > The 'S' type is clearly broken on py3 (at least). I think that gives us > room to change it, and backward compatibly is less of an issue because it's > broken already -- do we need to preserve bug-for-bug compatibility? Maybe, > but I suspect in this case, not -- the code the "works fine" on py3 with > the 'S' type is probably only lucky that it hasn't encountered the issues > yet. > > And no matter how you slice it, code being ported to py3 needs to deal > with text handling issues. > > But here is where we stand: > > The 'S' dtype: > > - was designed for one-byte-per-char text data. > - was mapped to the py2 string type. > - used the classic C null-terminated approach. > - can be used for arbitrary bytes (as the py2 string type can), but not > quite, as it truncates null bytes -- so it really a bad idea to use it that > way. > > Under py3: > The 'S' type maps to the py3 bytes type, because that's the closest to > the py2 string type. But it also does some inconsistent things with > encoding, and does treat a lot of other things as text. But the py3 bytes > type does not have the same text handling as the py2 string type, so things > like: > > s = 'a string' > np.array((s,), dtype='S')[0] == s > > Gives you False, rather than True on py2. This is because a py3 string is > translated to the 'S' type (presumable with the default encoding, another > maybe not a good idea, but returns a bytes object, which does not compare > true to a py3 string. YOu can work aroudn this with varios calls to > encode() and decode, and/or using b'a string', but that is ugly, kludgy, > and doesn't work well with the py3 text model. > > > The py2 => py3 transition separated bytes and strings: strings are > unicode, and bytes are not to be used for text (directly). While there is > some text-related functionality still in bytes, the core devs are quite > clear that that is for special cases only, and not for general text > processing. > > I don't think numpy should fight this, but rather embrace the py3 text > model. The most natural way to do that is to use the existing 'U' dtype for > text. Really the best solution for most cases. (Like the above case) > > However, there is a use case for a more efficient way to deal with text. > There are a couple ways to go about that that have been brought up here: > > 1: have a more efficient unicode dtype: variable length, > multiple encoding options, etc.... > - This is a fine idea that would support better text handling in > numpy, and _maybe_ better interaction with external libraries (HDF, etc...) > > 2: Have a one-byte-per-char text dtype: > - This would be much easier to implement fit into the current numpy > model, and satisfy a lot of common use cases for scientific data sets. > > We could certainly do both, but I'd like to see (2) get done sooner than > later.... > This is pretty much my sense of things at the moment. I think 1) is needed in the long term but that 2) is a quick fix that solves most problems in the short term. > > A related issue is whether numpy needs a dtype analogous to py3 bytes -- > I'm still not sure of the use-case there, so can't comment -- would it need > to be fixed length (fitting into the numpy data model better) or variable > length, or ??? Some folks are (apparently) using the current 'S' type in > this way, but I think that's ripe for errors, due to the null bytes issue. > Though maybe there is a null-bytes-are-special binary format that isn't > text -- I have no idea. > > So what do we do with 'S'? It really is pretty broken, so we have a > couple choices: > > (1) depricate it, so that it stays around for backward compatibility > but encourage people to either use 'U' for text, or one of the new dtypes > that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), > and use either uint8 or the new bytes dtype that is yet to be implemented. > > (2) fix it -- in this case, I think we need to be clear what it is: > -- A one-byte-char-text type? If so, it should map to a py3 string, > and have a defined encoding (ascii or latin-1, probably), or even better a > settable encoding (but only for one-byte-per-char encodings -- I don't > think utf-8 is a good idea here, as a utf-8 encoded string is of unknown > length. (there is some room for debate here, as the 'S' type is fixed > length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as > long as it doesn't partially truncate in teh middle of a charactor) > I think we should make it a one character encoded type compatible with str in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of pep 393 where it is effectively a UCS-1, but ascii might be a bit more flexible because it is a subset of utf-8 and might serve better in python 2. > -- a bytes type? in which case, we should clean out all teh > automatic conversion to-from text that iare in it now. > > I'm not sure what to do about a bytes type. > I vote for it being our one-byte text type -- it almost is already, and it > would make the easiest transition for folks from py2 to py3. But backward > compatibility is backward compatibility. > > Not sure what to do here. It would be nice if S was a string type of given encoding. Might be worth an experiment to see how much breaks. > > numpy arrays need a decode and encode method > > > I'm not sure that they do. Rather there needs to be a text dtype that >> knows what encoding to use in order to have a binary interface as >> exposed by .tostring() and friends and but produce unicode strings >> when indexed from Python code. Having both a text and a binary >> interface to the same data implies having an encoding. > > > I agree with Oscar here -- let's not conflate encode and decoded data -- > the py3 text model is a fine one, we should work with it as much > as practical. > > UNLESS: if we do add a bytes dtype, then it would be a reasonable use case > to use it to store encoded text (just like the py3 bytes types), in which > case it would be good to have encode() and decode() methods or ufuncs -- > probably ufuncs. But that should be for special purpose, at the I/O > interface kind of stuff. > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Sat Jan 25 12:34:32 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Sat, 25 Jan 2014 17:34:32 +0000 (UTC) Subject: [Numpy-discussion] Numpy arrays vs typed memoryviews References: <68ad9af0-d113-4b20-ba44-11215a193a37@googlegroups.com> Message-ID: <1245884549412361147.264258sturla.molden-gmail.com@news.gmane.org> I think I have said this before, but its worth a repeat: Pickle (including cPickle) is a slow hog! That might not be the overhead you see, you just haven't noticed it yet. I saw this some years ago when I worked on shared memory arrays for Numpy (cf. my account on Github). Shared memory really did not help to speed up the IPC, because the entire overhead was dominated by pickle. (Shared memory is a fine way of saving RAM, though.) multiprocessing.Queue will use pickle for serialization, and is therefore not the right tool for numerical parallel computing with Cython or NumPy. In order to use multiprocessing efficiently with NumPy, we need a new Queue type that knows about NumPy arrays (and/or Cython memoryviews), and treat them as special cases. Getting rid of pickle altogether is the important part, not facilitating its use even further. It is easy to make a Queue type for Cython or NumPy arrays using a duplex pipe and couple of mutexes. Or you can use shared memory as ringbuffer and atomic compare-and-swap on the first bytes as spinlocks. It is not difficult to get the overhead of queuing arrays down to little more than a memcpy. I've been wanting to do this for a while, so maybe it is time to start a new toy project :) Sturla Neal Hughes wrote: > I like Cython a lot. My only complaint is that I have to keep switching > between the numpy array support and typed memory views. Both have there > advantages but neither can do every thing I need. > > Memoryviews have the clean syntax and seem to work better in cdef classes > and in inline functions. > > But Memoryviews can't be pickled and so can't be passed between > processes. Also there seems to be a high overhead on converting between > memory views and python numpy arrays. Where this overhead is a problem, > or where i need to use pythons multiprocessing module I tend to switch to numpy arrays. > > If memory views could be converted into python fast, and pickled I would > have no need for the old numpy array support. > > Wondering if these problems will ever be addressed, or if I am missing > something completely. > > -- > > --- > You received this message because you are subscribed to the Google Groups > "cython-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to cython-users+unsubscribe at googlegroups.com. > For more options, visit href="https://groups.google.com/groups/opt_out.">https://groups.google.com/groups/opt_out. > > ------=_Part_1342_18667054.1390644115997 > Content-Type: text/html; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > >
I like Cython a lot. My only complaint is that I have to k= > eep switching between the numpy array support and typed memory views. Both = > have there advantages but neither can do every thing I need.

= >
Memoryviews have the clean syntax and seem to work better in cdef clas= > ses and in inline functions.

But Memoryviews can't= > be pickled and so can't be passed between processes. Also there seems to b= > e a high overhead on converting between memory views and python numpy array= > s. Where this overhead is a problem, or where i need to use pythons multipr= > ocessing module I tend to switch to numpy arrays.

= > If memory views could be converted into python fast, and pickled I would ha= > ve no need for the old numpy array support.

Wonder= > ing if these problems will ever be addressed, or if I am missing something = > completely.


> >

> > --
> &nbsp;
> ---
> You received this message because you are subscribed to the Google Groups &= > quot;cython-users&quot; group.
> To unsubscribe from this group and stop receiving emails from it, send an e= > mail to cython-users+unsubscribe at googlegroups.com.
> For more options, visit ">https://groups.google.com/groups/opt_out.
> > ------=_Part_1342_18667054.1390644115997-- From stefan at sun.ac.za Sat Jan 25 17:06:56 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Sat, 25 Jan 2014 23:06:56 +0100 Subject: [Numpy-discussion] Comparison changes In-Reply-To: <1390608315.7837.22.camel@sebastian-laptop> References: <1390608315.7837.22.camel@sebastian-laptop> Message-ID: <20140125220656.GH26658@gmail.com> On Sat, 25 Jan 2014 01:05:15 +0100, Sebastian Berg wrote: > 1. Comparison with None will broadcast in the future, so that `arr == > None` will actually compare all elements to None. (A FutureWarning for > now) This is a very useful change in behavior--thanks! St?fan From oscar.j.benjamin at gmail.com Sat Jan 25 17:45:13 2014 From: oscar.j.benjamin at gmail.com (Oscar Benjamin) Date: Sat, 25 Jan 2014 22:45:13 +0000 Subject: [Numpy-discussion] Text array dtype for numpy In-Reply-To: References: Message-ID: On 24 January 2014 22:43, Chris Barker wrote: > Oscar, > > Cool stuff, thanks! > > I'm wondering though what the use-case really is. The use-case is precisely the use-case for dtype='S' on Py2 except that it also works on Py3. > The P3 text model > (actually the py2 one, too), is quite clear that you want users to think of, > and work with, text as text -- and not care how things are encoding in the > underlying implementation. You only want the user to think about encodings > on I/O -- transferring stuff between systems where you can't avoid it. And > you might choose different encodings based on different needs. Exactly. But what you're missing is that storing text in a numpy array is putting the text into bytes and the encoding needs to be specified. My proposal involves explicitly specifying the encoding. This is the key point about the Python 3 text model: it is not that encoding isn't automatic (e.g. when you print() or call file.write with a text file); the point is that there must never be ambiguity about the encoding that is used when encode/decode occurs. > So why have a different, the-user-needs-to-think-about-encodings numpy > dtype? We already have 'U' for full-on unicode support for text. There is a > good argument for a more compact internal representation for text compatible > with one-byte-per-char encoding, thus the suggestion for such a dtype. But I > don't see the need for quite this. Maybe I'm not being a creative enough > thinker. Because users want to store text in a numpy array and use less than 4 bytes per character. You expressed a desire for this. The only difference between this and your latin-1 suggestion is that this one has an explicit encoding that is visible to the user and that you can choose that encoding to be anything that your Python installation supports. > Also, we may want numpy to interact at a low level with other libs that > might have binary encoded text (HDF, etc) -- in which case we need a bytes > dtype that can store that data, and perhaps encoding and decoding ufuncs. Perhaps there is a need for a bytes dtype as well. But not that you can use textarray with encoding='ascii' to satisfy many of these use cases. So h5py and pytables can expose an interface that stores text as bytes but has a clearly labelled (and enforced) encoding. > If we want a more efficient and compact unicode implementation then the py3 > one is a good place to start -it's pretty slick! Though maybe harder to due > in numpy as text in numpy probably wouldn't be immutable. It's not a good fit for numpy because numpy arrays expose their memory buffer. More on this below but if there was to be something as drastic as the FSR then it would be better to think about how to make an ndarray type that is completely different, has an opaque memory buffer and can handle arbitrary length text strings. >> To make a slightly more concrete proposal, I've implemented a pure >> Python ndarray subclass that I believe can consistently handle >> text/bytes in Python 3. > > this scares me right there -- is it text or bytes??? We really don't want > something that is both. I believe that there is a conceptual misunderstanding about what a numpy array is here. A numpy array is a clever view onto a memory buffer. A numpy array always has two interfaces, one that describes a memory buffer and one that delivers Python objects representing the abstract quantities described by each portion of the memory buffer. The dtype specifies three things: 1) How many bytes of the buffer are used. 2) What kind of abstract object this part of the buffer represents. 3) The mapping from the bytes in this segment of the buffer to the abstract object. As an example: >>> import numpy as np >>> a = np.array([1, 2, 3], dtype='>> a array([1, 2, 3], dtype=uint32) >>> a.tostring() b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' So what is this array? Is it bytes or is it integers? It is both. The array is a view onto a memory buffer and the dtype is the encoding that describes the meaning of the bytes in different segments. In this case the dtype is '>> a = np.array(['qwe'], dtype='U') >>> a array(['qwe'], dtype='>> a[0] # text 'qwe' >>> a.tostring() # bytes b'q\x00\x00\x00w\x00\x00\x00e\x00\x00\x00' In my proposal you'd get the same by using 'utf-32-le' as the encoding for your text array. >> The idea is that the array has an encoding. It stores strings as >> bytes. The bytes are encoded/decoded on insertion/access. Methods >> accessing the binary content of the array will see the encoded bytes. >> Methods accessing the elements of the array will see unicode strings. >> >> I believe it would not be as hard to implement as the proposals for >> variable length string arrays. > > except that with some encodings, the number of bytes required is a function > of what the content of teh text is -- so it either has to be variable > length, or a fixed number of bytes, which is not a fixed number of > characters which require both careful truncation (a pain), and surprising > results for users "why can't I fit 10 characters is a length-10 text > object? And I can if they are different characters?) It should be a fixed number of bytes. It does mean that 10 characters might not fit into a 10-byte text portion but there's no way around that if it is a fixed length and the encoding is variable-width. I don't really think that this is much of a problem though. Most use cases are probably going to use 'ascii' anyway. The improvement those use-cases get is error detection for non-ascii characters and explicitly labelled encodings, rather than mojibake. >> The one caveat is that it will strip >> null characters from the end of any string. > > which is fatal, but you do want a new dtype after all, which presumably > wouldn't do that. Why is that fatal for text (not arbitrary byte strings)? There are many other reasons (relating to other programming languages and software) why you can't usually put null characters into text anyway. I don't really see how to get around this if the bytes must go into fixed-width portions without an out-of-band way to specify the length of the string. Oscar From francesc at continuum.io Sun Jan 26 02:39:00 2014 From: francesc at continuum.io (Francesc Alted) Date: Sun, 26 Jan 2014 08:39:00 +0100 Subject: [Numpy-discussion] ANN: numexpr 2.3 (final) released Message-ID: <52E4BB94.80300@continuum.io> ========================== Announcing Numexpr 2.3 ========================== Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Numexpr is already being used in a series of packages (PyTables, pandas, BLZ...) for helping doing computations faster. What's new ========== The repository has been migrated to https://github.com/pydata/numexpr. All new tickets and PR should be directed there. Also, a `conj()` function for computing the conjugate of complex arrays has been added. Thanks to David Men?ndez. See PR #125. Finallly, we fixed a DeprecationWarning derived of using ``oa_ndim == 0`` and ``op_axes == NULL`` with `NpyIter_AdvancedNew()` and NumPy 1.8. Thanks to Mark Wiebe for advise on how to fix this properly. Many thanks to Christoph Gohlke and Ilan Schnell for his help during the testing of this release in all kinds of possible combinations of platforms and MKL. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? ========================= The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience ===================== Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted From francesc at continuum.io Sun Jan 26 02:44:25 2014 From: francesc at continuum.io (Francesc Alted) Date: Sun, 26 Jan 2014 08:44:25 +0100 Subject: [Numpy-discussion] ANN: python-blosc 1.2.0 released Message-ID: <52E4BCD9.90807@continuum.io> ============================= Announcing python-blosc 1.2.0 ============================= What is new? ============ This release adds support for the multiple compressors added in Blosc 1.3 series. The new compressors are: * lz4 (http://code.google.com/p/lz4/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * lz4hc (http://code.google.com/p/lz4/): This is a variation of LZ4 that achieves much better compression ratio at the cost of being much slower for compressing. Decompression speed is unaffected (and sometimes better than when using LZ4 itself!), so this is very good for read-only datasets. * snappy (http://code.google.com/p/snappy/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * zlib (http://www.zlib.net/): This is a classic. It achieves very good compression ratios, at the cost of speed. However, decompression speed is still pretty good, so it is a good candidate for read-only datasets. Selecting the compressor is just a matter of specifying the new `cname` parameter in compression functions. For example:: in = numpy.arange(N, dtype=numpy.int64) out = blosc.pack_array(in, cname="lz4") Just to have an overview of the differences between the different compressors in new Blosc, here it is the output of the included compress_ptr.py benchmark: https://github.com/ContinuumIO/python-blosc/blob/master/bench/compress_ptr.py that compresses/decompresses NumPy arrays with different data distributions:: Creating different NumPy arrays with 10**7 int64/float64 elements: *** np.copy() **** Time for memcpy(): 0.030 s *** the arange linear distribution *** *** blosclz *** Time for comp/decomp: 0.013/0.022 s. Compr ratio: 136.83 *** lz4 *** Time for comp/decomp: 0.009/0.031 s. Compr ratio: 137.19 *** lz4hc *** Time for comp/decomp: 0.103/0.021 s. Compr ratio: 165.12 *** snappy *** Time for comp/decomp: 0.012/0.045 s. Compr ratio: 20.38 *** zlib *** Time for comp/decomp: 0.243/0.056 s. Compr ratio: 407.60 *** the linspace linear distribution *** *** blosclz *** Time for comp/decomp: 0.031/0.036 s. Compr ratio: 14.27 *** lz4 *** Time for comp/decomp: 0.016/0.033 s. Compr ratio: 19.68 *** lz4hc *** Time for comp/decomp: 0.188/0.020 s. Compr ratio: 78.21 *** snappy *** Time for comp/decomp: 0.020/0.032 s. Compr ratio: 11.72 *** zlib *** Time for comp/decomp: 0.290/0.048 s. Compr ratio: 90.90 *** the random distribution *** *** blosclz *** Time for comp/decomp: 0.083/0.025 s. Compr ratio: 4.35 *** lz4 *** Time for comp/decomp: 0.022/0.034 s. Compr ratio: 4.65 *** lz4hc *** Time for comp/decomp: 1.803/0.039 s. Compr ratio: 5.61 *** snappy *** Time for comp/decomp: 0.028/0.023 s. Compr ratio: 4.48 *** zlib *** Time for comp/decomp: 3.146/0.073 s. Compr ratio: 6.17 That means that Blosc in combination with LZ4 can compress at speeds that can be up to 3x faster than a pure memcpy operation. Decompression is a bit slower (but still in the same order than memcpy()) probably because writing to memory is slower than reading. This was using an Intel Core i5-3380M CPU @ 2.90GHz, runnng Python 3.3 and Linux 3.7.10, but YMMV (and will vary!). For more info, you can have a look at the release notes in: https://github.com/ContinuumIO/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://blosc.pydata.org What is it? =========== python-blosc (http://blosc.pydata.org/) is a Python wrapper for the Blosc compression library. Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Whether this is achieved or not depends of the data compressibility, the number of cores in the system, and other factors. See a series of benchmarks conducted for many different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from your favorite shell. Installing ========== python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources ================ The sources are managed through github services at: http://github.com/ContinuumIO/python-blosc Documentation ============= There is Sphinx-based documentation site at: http://blosc.pydata.org/ Mailing list ============ There is an official mailing list for Blosc at: blosc at googlegroups.com http://groups.google.es/group/blosc Licenses ======== Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/ContinuumIO/python-blosc/blob/master/LICENSES for more details. -- Francesc Alted Continuum Analytics, Inc. -- Francesc Alted From francesc at continuum.io Sun Jan 26 02:52:19 2014 From: francesc at continuum.io (Francesc Alted) Date: Sun, 26 Jan 2014 08:52:19 +0100 Subject: [Numpy-discussion] ANN: BLZ 0.6.1 has been released Message-ID: <52E4BEB3.5070003@continuum.io> Announcing BLZ 0.6 series ========================= What it is ---------- BLZ is a chunked container for numerical data. Chunking allows for efficient enlarging/shrinking of data container. In addition, it can also be compressed for reducing memory/disk needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. The main objects in BLZ are `barray` and `btable`. `barray` is meant for storing multidimensional homogeneous datasets efficiently. `barray` objects provide the foundations for building `btable` objects, where each column is made of a single `barray`. Facilities are provided for iterating, filtering and querying `btables` in an efficient way. You can find more info about `barray` and `btable` in the tutorial: http://blz.pydata.org/blz-manual/tutorial.html BLZ can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too) either from memory or from disk. In the future, it is planned to use Numba as the computational kernel and to provide better Blaze (http://blaze.pydata.org) integration. What's new ---------- BLZ has been branched off from the Blaze project (http://blaze.pydata.org). BLZ was meant as a persistent format and library for I/O in Blaze. BLZ in Blaze is based on previous carray 0.5 and this is why this new version is labeled 0.6. BLZ supports completely transparent storage on-disk in addition to memory. That means that *everything* that can be done with the in-memory container can be done using the disk as well. The advantages of a disk-based container is that the addressable space is much larger than just your available memory. Also, as BLZ is based on a chunked and compressed data layout based on the super-fast Blosc compression library, the data access speed is very good. The format chosen for the persistence layer is based on the 'bloscpack' library and described in the "Persistent format for BLZ" chapter of the user manual ('docs/source/persistence-format.rst'). More about Bloscpack here: https://github.com/esc/bloscpack You may want to know more about BLZ in this blog entry: http://continuum.io/blog/blz-format In this version, support for Blosc 1.3 has been added, that meaning that a new `cname` parameter has been added to the `bparams` class, so that you can select you preferred compressor from 'blosclz', 'lz4', 'lz4hc', 'snappy' and 'zlib'. Also, many bugs have been fixed, providing a much smoother experience. CAVEAT: The BLZ/bloscpack format is still evolving, so don't trust on forward compatibility of the format, at least until 1.0, where the internal format will be declared frozen. Resources --------- Visit the main BLZ site repository at: http://github.com/ContinuumIO/blz Read the online docs at: http://blz.pydata.org/blz-manual/index.html Home of Blosc compressor: http://www.blosc.org User's mail list: blaze-dev at continuum.io ---- Enjoy! Francesc Alted Continuum Analytics, Inc. From chris.laumann at gmail.com Sun Jan 26 03:04:37 2014 From: chris.laumann at gmail.com (Chris Laumann) Date: Sun, 26 Jan 2014 00:04:37 -0800 Subject: [Numpy-discussion] Memory leak in numpy? Message-ID: Hi all- I think I just found a memory leak in numpy, or maybe I just don?t understand generators. Anyway, the following snippet will quickly eat a ton of RAM: P = randint(0,2, (20,13)) for i in range(50): ? ? for ai in ndindex((2,)*13): ? ? ? ? j = P.dot(ai) If you replace the last line with something like j = ai, the memory leak goes away. I?m not exactly sure what?s going on but the .dot seems to be causing the memory taken by the tuple ai to be held. This devours RAM in python 2.7.5 (OS X Mavericks default I believe), numpy version 1.8.0.dev-3084618. I?m upgrading to the latest Superpack (numpy 1.9) right now but I somehow doubt this behavior will change. Any thoughts? Best, Chris --? Chris Laumann Sent with Airmail -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Sun Jan 26 06:20:04 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Sun, 26 Jan 2014 12:20:04 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality Message-ID: Hi all, Please critique my draft exploring the possibilities of adding group_by support to numpy: http://pastebin.com/c5WLWPbp In nearly ever project I work on, I require group_by functionality of some sort. There are other libraries that provide this kind of functionality, such as pandas for instance, but I will try to make the case here that numpy ought to have a solid core of group_by functionality. Primarily, one may argue that the concept of grouping values by a key is far more general than a pandas dataframe. In particular, one often needs a simple one-line transient association between some keys and values, and trying to wrangle your problem into the more permanent and specialized datastructure that a dataframe is, is simply not called for. As a simple compact example: key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values) Points of note; we can group by arbitrary combinations of keys, and subarrays can also act as keys. group_by has a rich set of reduction functionality, which performs efficient per-group reductions, as well as various ways to split your values per group. Also, the code here has a lot of overlap with np.unique and related arraysetops. functions like np.unique are easily reimplemented using the groundwork laid out here, and also may be extended to benefit from the generalizations made, allowing for a wider variety of objects to have their unique values taken; note the axis keyword here, meaning that what is unique here are the images found along the first axis; not the elements of shuffled. #create a stack of images images = np.random.rand(4,64,64) #shuffle the images; this is a giant mess now; how to find all the original ones? shuffled = images[np.random.randint(0,4,200)] #there you go print unique(shuffled, axis=0) Some more examples and unit tests can be found at the end of the module. Id love to hear your feedback on this. Specifically: - Do you agree numpy would benefit from group_by functionality? - Do you have suggestions for further generalizations/extensions? - Any commentary on design decisions / implementation? Regards, Eelco Hoogendoorn -------------- next part -------------- An HTML attachment was scrubbed... URL: From dineshbvadhia at hotmail.com Sun Jan 26 07:44:31 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Sun, 26 Jan 2014 04:44:31 -0800 Subject: [Numpy-discussion] MKL and OpenBLAS Message-ID: This conversation gets discussed often with Numpy developers but since the requirement for optimized Blas is pretty common these days, how about distributing Numpy with OpenBlas by default? People who don't want optimized BLAS or OpenBLAS can then edit the site.cfg file to add/remove. I can never remember if Numpy comes with Atlas by default but either way, if using MKL is not feasible because of its licensing issues then Numpy has to be re-compiled with OpenBLAS (for example). Why not make it easier for developers to use Numpy with an in-built optimized Blas. Btw, just in case some folks from Intel are listening: how about releasing MKL binaries for all platforms for developers to do with it what they want ie. free. You know it makes sense! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dineshbvadhia at hotmail.com Sun Jan 26 07:54:14 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Sun, 26 Jan 2014 04:54:14 -0800 Subject: [Numpy-discussion] ANN: numexpr 2.3 (final) released In-Reply-To: <52E4BB94.80300@continuum.io> References: <52E4BB94.80300@continuum.io> Message-ID: Francesc Congratulations and will definitely be benchmarking Numexpr soon. Will similar performance improvements been seen with OpenBLAS as with MKL? From dineshbvadhia at hotmail.com Sun Jan 26 08:14:37 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Sun, 26 Jan 2014 05:14:37 -0800 Subject: [Numpy-discussion] ANN: BLZ 0.6.1 has been released In-Reply-To: <52E4BEB3.5070003@continuum.io> References: <52E4BEB3.5070003@continuum.io> Message-ID: For me, "binary data" wrt arrays means that data values are [0|1]. Is this what is meant in "The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data." ? From pav at iki.fi Sun Jan 26 09:40:44 2014 From: pav at iki.fi (Pauli Virtanen) Date: Sun, 26 Jan 2014 16:40:44 +0200 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: 26.01.2014 14:44, Dinesh Vadhia kirjoitti: > This conversation gets discussed often with Numpy developers but > since the requirement for optimized Blas is pretty common these > days, how about distributing Numpy with OpenBlas by default? People > who don't want optimized BLAS or OpenBLAS can then edit the > site.cfg file to add/remove. I can never remember if Numpy comes > with Atlas by default but either way, if using MKL is not feasible > because of its licensing issues then Numpy has to be re-compiled > with OpenBLAS (for example). Why not make it easier for developers > to use Numpy with an in-built optimized Blas. The Numpy Windows binaries distributed in the numpy project at sourceforge.net are compiled with ATLAS, which should count as an optimized BLAS. I don't recall what's the situation with OSX binaries, but I'd believe they're with Atlas too. If you are suggesting bundling OpenBLAS with Numpy source releases --- arguments against: OpenBLAS is big, and still rapidly moving. Moreover, bundling it with Numpy does not really make it any easier to build. -- Pauli Virtanen From valentin at haenel.co Sun Jan 26 10:44:41 2014 From: valentin at haenel.co (Valentin Haenel) Date: Sun, 26 Jan 2014 16:44:41 +0100 Subject: [Numpy-discussion] ANN: BLZ 0.6.1 has been released In-Reply-To: References: <52E4BEB3.5070003@continuum.io> Message-ID: <20140126154441.GA23374@kudu.in-berlin.de> Hi Dinesh Vadhia, * Dinesh Vadhia [2014-01-26]: > For me, "binary data" wrt arrays means that data values are [0|1]. Is this > what is meant in "The compression process is carried out internally by > Blosc, a high-performance compressor that is optimized for binary data." ? I believe, the term 'binary data' in this context refers to numerical data -- e.g. floats and ints -- in the sense that it is not ascii or other text. Blosc is especially well suited for this kind of data due to its optional shuffle filter. This filter will re-organize the bytes in the data that is to be compressed in order of significance. For this filter to work, each data value must be composed of multiple bytes, e.g. int64. For data values that are composed of a single byte, e.g. int8 or char, the filter does not work so well. Hope that helps, V- From stefan at sun.ac.za Sun Jan 26 12:02:40 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Sun, 26 Jan 2014 18:02:40 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: References: Message-ID: <20140126170240.GA4256@shinobi> Hi Eelco On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > key1 = list('abaabb') > key2 = np.random.randint(0,2,(6,2)) > values = np.random.rand(6,3) > print group_by((key1, key2)).median(values) I agree that group_by functionality could be handy in numpy. In the above example, what would the output of ``group_by((key1, key2))`` be? St?fan From stefan at sun.ac.za Sun Jan 26 12:06:40 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Sun, 26 Jan 2014 18:06:40 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: <20140126170640.GB4256@shinobi> On Sun, 26 Jan 2014 16:40:44 +0200, Pauli Virtanen wrote: > The Numpy Windows binaries distributed in the numpy project at > sourceforge.net are compiled with ATLAS, which should count as an > optimized BLAS. I don't recall what's the situation with OSX binaries, > but I'd believe they're with Atlas too. Was a switch made away from Accelerate after this? http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html St?fan From hoogendoorn.eelco at gmail.com Sun Jan 26 12:36:21 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Sun, 26 Jan 2014 18:36:21 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: <20140126170240.GA4256@shinobi> References: <20140126170240.GA4256@shinobi> Message-ID: An object of type GroupBy. So a call to group_by does not return any consumable output directly. If you want for instance the unique keys, or groups if you will, you can call GroupBy.unique. In this case, for a tuple of input keys, youd get a tuple of unique keys back. If you want to compute several reductions over the same set of keys, you can hang on to the GroupBy object, and the precomputations it encapsulates. To expand on that example: reduction operations also return the unique keys which the reduced elements belong to: (unique1, unique2), median = group_by((key1, key2)).median(values) print unique1 print unique2 print median yields something like ['a' 'a' 'b' 'b' 'a'] [[0 0] [0 1] [0 1] [1 0] [1 1]] [[ 0.34041782 0.78579254 0.91494441] [ 0.59422888 0.67915262 0.04327812] [ 0.45045529 0.45049761 0.49633574] [ 0.71623235 0.95760152 0.85137696] [ 0.96299801 0.27639574 0.70519413]] Note that the elements of unique1 and unique2 are not themselves unique, but rather their elements zipped together are unique. On Sun, Jan 26, 2014 at 6:02 PM, St?fan van der Walt wrote: > Hi Eelco > > On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > > key1 = list('abaabb') > > key2 = np.random.randint(0,2,(6,2)) > > values = np.random.rand(6,3) > > print group_by((key1, key2)).median(values) > > I agree that group_by functionality could be handy in numpy. > In the above example, what would the output of > > ``group_by((key1, key2))`` > > be? > > St?fan > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoogendoorn.eelco at gmail.com Sun Jan 26 12:50:04 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Sun, 26 Jan 2014 18:50:04 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: <20140126170240.GA4256@shinobi> References: <20140126170240.GA4256@shinobi> Message-ID: To follow up with an example as to why it is useful that a temporary object is created, consider the following (taken from the radial reduction example): g = group_by(np.round(radius, 5).flatten()) pp.errorbar( g.unique, g.mean(sample.flatten())[1], g.std(sample.flatten())[1] / np.sqrt(g.count)) Creating the GroupBy object encapsulates the expense of 'indexing' the keys, which is the most expensive part of these operations. We would have to redo that four times here, if we didn't have access to the GroupBy object. >From looking at the numpy source, I get the impression that it is considered good practice not to overuse OOP. And I agree, but I think it is called for here. On Sun, Jan 26, 2014 at 6:02 PM, St?fan van der Walt wrote: > Hi Eelco > > On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote: > > key1 = list('abaabb') > > key2 = np.random.randint(0,2,(6,2)) > > values = np.random.rand(6,3) > > print group_by((key1, key2)).median(values) > > I agree that group_by functionality could be handy in numpy. > In the above example, what would the output of > > ``group_by((key1, key2))`` > > be? > > St?fan > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.isaac at gmail.com Sun Jan 26 12:57:11 2014 From: alan.isaac at gmail.com (Alan G Isaac) Date: Sun, 26 Jan 2014 12:57:11 -0500 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: <20140126170240.GA4256@shinobi> References: <20140126170240.GA4256@shinobi> Message-ID: <52E54C77.80601@gmail.com> On 1/26/2014 12:02 PM, St?fan van der Walt wrote: > what would the output of > > ``group_by((key1, key2))`` I'd expect something named "groupby" to behave as below. Alan def groupby(seq, key): from collections import defaultdict groups = defaultdict(list) for item in seq: groups[key(item)].append(item) return groups print groupby(range(20), lambda x: x%2) From jtaylor.debian at googlemail.com Sun Jan 26 13:13:57 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sun, 26 Jan 2014 19:13:57 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: <20140126170640.GB4256@shinobi> References: <20140126170640.GB4256@shinobi> Message-ID: <52E55065.5020002@googlemail.com> On 26.01.2014 18:06, St?fan van der Walt wrote: > On Sun, 26 Jan 2014 16:40:44 +0200, Pauli Virtanen wrote: >> The Numpy Windows binaries distributed in the numpy project at >> sourceforge.net are compiled with ATLAS, which should count as an >> optimized BLAS. I don't recall what's the situation with OSX binaries, >> but I'd believe they're with Atlas too. > > Was a switch made away from Accelerate after this? > > http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html > if this issue disqualifies accelerate, it also disqualifies openblas as a default. openblas has the same issue, we stuck a big fat warning into the docs (site.cfg) for this now as people keep running into it. openblas is also a little dodgy concerning stability, in the past it crashed constantly on pretty standard problems, like dgemm on data > 64 mb etc. While the stability has improved with latest releases (>= 0.2.9) I think its still too early to consider openblas for a default. multithreaded ATLAS on the other hand seems works fine, at least I have not seen any similar issues with ATLAS in a very long time. Building optimized ATLAS is also a breeze on Debian based systems (see the README.Debian file) but I admit it is hard on any other platform. From hoogendoorn.eelco at gmail.com Sun Jan 26 14:01:01 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Sun, 26 Jan 2014 20:01:01 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: <52E54C77.80601@gmail.com> References: <20140126170240.GA4256@shinobi> <52E54C77.80601@gmail.com> Message-ID: Alan: The equivalent of that in my current draft would be group_by(keys, values), which is shorthand for group_by(keys).group(values); a optional values argument to the constructor of GroupBy is directly bound to return an iterable over the grouped values; but we often want to bind different value objects, with different operations, for the same set of keys, so it is convenient to be able to delay the binding of the values argument. Also, the third argument to group_by is an optional reduction function. On Sun, Jan 26, 2014 at 6:57 PM, Alan G Isaac wrote: > On 1/26/2014 12:02 PM, St?fan van der Walt wrote: > > what would the output of > > > > ``group_by((key1, key2))`` > > > I'd expect something named "groupby" to behave as below. > Alan > > def groupby(seq, key): > from collections import defaultdict > groups = defaultdict(list) > for item in seq: > groups[key(item)].append(item) > return groups > > print groupby(range(20), lambda x: x%2) > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.isaac at gmail.com Sun Jan 26 14:44:44 2014 From: alan.isaac at gmail.com (Alan G Isaac) Date: Sun, 26 Jan 2014 14:44:44 -0500 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: References: <20140126170240.GA4256@shinobi> <52E54C77.80601@gmail.com> Message-ID: <52E565AC.1040503@gmail.com> My comment is just on the name. I'd expect something named `groupby` to behave essentially like Mathematica's `GatherBy` command. http://reference.wolfram.com/mathematica/ref/GatherBy.html I think you are after something more like Matlab's grpstats: http://www.mathworks.com/help/stats/grpstats.html Perhaps the implicit reference to SQL justifies the name... Sorry if this seems off topic, Alan Isaac From hoogendoorn.eelco at gmail.com Sun Jan 26 15:16:47 2014 From: hoogendoorn.eelco at gmail.com (Eelco Hoogendoorn) Date: Sun, 26 Jan 2014 21:16:47 +0100 Subject: [Numpy-discussion] Numpy Enhancement Proposal: group_by functionality In-Reply-To: <52E565AC.1040503@gmail.com> References: <20140126170240.GA4256@shinobi> <52E54C77.80601@gmail.com> <52E565AC.1040503@gmail.com> Message-ID: not off topic at all; there are several matters of naming that I am not at all settled on yet, and I don't think it is unimportant. indeed, those are closely related functions, and I wasn't aware of them yet, so that's some welcome additional perspective. The mathematica function differs in that the keys are always function of the values; as per your example as well. My proposed interface does not have that constraint, but that behavior is of course easily obtained by something like group_by(mapping(values), values). indeed grpstats also has a lot of overlap, though it does not have the same generality as my proposal. its interesting to wonder where one gets ones ideas as to how to call what. ive never worked with SQL much; I suppose I picked up this naming by working with LINQ. I rather like group_by; it is more suitable to the generality of the operations supported by the group_by object than something like grpstats. The majority of my applications for grouping have nothing whatsoever to do with statistics. On Sun, Jan 26, 2014 at 8:44 PM, Alan G Isaac wrote: > My comment is just on the name. > I'd expect something named `groupby` > to behave essentially like Mathematica's `GatherBy` command. > http://reference.wolfram.com/mathematica/ref/GatherBy.html > > I think you are after something more like Matlab's grpstats: > http://www.mathworks.com/help/stats/grpstats.html > > Perhaps the implicit reference to SQL justifies the name... > > Sorry if this seems off topic, > Alan Isaac > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Sun Jan 26 16:33:04 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Sun, 26 Jan 2014 21:33:04 +0000 (UTC) Subject: [Numpy-discussion] MKL and OpenBLAS References: <20140126170640.GB4256@shinobi> <52E55065.5020002@googlemail.com> Message-ID: <566884211412464364.536717sturla.molden-gmail.com@news.gmane.org> Julian Taylor wrote: > if this issue disqualifies accelerate, it also disqualifies openblas as > a default. openblas has the same issue, we stuck a big fat warning into > the docs (site.cfg) for this now as people keep running into it. What? Last time I checked, OpenBLAS (and GotoBLAS2) used OpenMP, not the GCD on Mac. Since OpenMP compiles to pthreads, it should not do this (pure POSIX). Accelerate uses the GCD yes, but it's hardly any better than ATLAS. If OpenBLAS now uses the GCD on Mac someone in China should be flogged. It is sad to hear about stability issues with OpenBLAS, it's predecessor GotoBLAS2 was rock solid. Sturla From jtaylor.debian at googlemail.com Sun Jan 26 17:13:17 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sun, 26 Jan 2014 23:13:17 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: <566884211412464364.536717sturla.molden-gmail.com@news.gmane.org> References: <20140126170640.GB4256@shinobi> <52E55065.5020002@googlemail.com> <566884211412464364.536717sturla.molden-gmail.com@news.gmane.org> Message-ID: <52E5887D.3070805@googlemail.com> On 26.01.2014 22:33, Sturla Molden wrote: > Julian Taylor wrote: > >> if this issue disqualifies accelerate, it also disqualifies openblas as >> a default. openblas has the same issue, we stuck a big fat warning into >> the docs (site.cfg) for this now as people keep running into it. > > What? Last time I checked, OpenBLAS (and GotoBLAS2) used OpenMP, not the > GCD on Mac. Since OpenMP compiles to pthreads, it should not do this (pure > POSIX). Accelerate uses the GCD yes, but it's hardly any better than ATLAS. > If OpenBLAS now uses the GCD on Mac someone in China should be flogged. the use of gnu openmp is probably be the problem, forking and gomp is only possible in very limited circumstances. see e.g. https://github.com/xianyi/OpenBLAS/issues/294 maybe it will work with clangs intel based openmp which should be coming soon. the current workaround is single threaded openblas, python3.4 forkserver or use atlas. From sturla.molden at gmail.com Sun Jan 26 18:01:52 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Sun, 26 Jan 2014 23:01:52 +0000 (UTC) Subject: [Numpy-discussion] MKL and OpenBLAS References: <20140126170640.GB4256@shinobi> <52E55065.5020002@googlemail.com> <566884211412464364.536717sturla.molden-gmail.com@news.gmane.org> <52E5887D.3070805@googlemail.com> Message-ID: <27092663412469230.633377sturla.molden-gmail.com@news.gmane.org> Julian Taylor wrote: > the use of gnu openmp is probably be the problem, forking and gomp is > only possible in very limited circumstances. > see e.g. https://github.com/xianyi/OpenBLAS/issues/294 > > maybe it will work with clangs intel based openmp which should be coming > soon. > the current workaround is single threaded openblas, python3.4 forkserver > or use atlas. Yes, it seems to be a GNU problem: http://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork This Howto also claims Intel compilers is not affected. :) Sturla From dineshbvadhia at hotmail.com Mon Jan 27 04:18:20 2014 From: dineshbvadhia at hotmail.com (Dinesh Vadhia) Date: Mon, 27 Jan 2014 01:18:20 -0800 Subject: [Numpy-discussion] ANN: numexpr 2.3 (final) released In-Reply-To: <52E4BB94.80300@continuum.io> References: <52E4BB94.80300@continuum.io> Message-ID: Francesc: Does numexpr support scipy sparse matrices? From cmkleffner at gmail.com Mon Jan 27 06:01:46 2014 From: cmkleffner at gmail.com (Carl Kleffner) Date: Mon, 27 Jan 2014 12:01:46 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: Did you consider to check the experimental binaries on https://code.google.com/p/mingw-w64-static/ for Python-2.7? These binaries has been build with with a customized mingw-w64 toolchain. These builds are fully statically build and are link against the MSVC90 runtime libraries (gcc runtime is linked statically) and OpenBLAS. Carl 2014-01-26 Pauli Virtanen > 26.01.2014 14:44, Dinesh Vadhia kirjoitti: > > This conversation gets discussed often with Numpy developers but > > since the requirement for optimized Blas is pretty common these > > days, how about distributing Numpy with OpenBlas by default? People > > who don't want optimized BLAS or OpenBLAS can then edit the > > site.cfg file to add/remove. I can never remember if Numpy comes > > with Atlas by default but either way, if using MKL is not feasible > > because of its licensing issues then Numpy has to be re-compiled > > with OpenBLAS (for example). Why not make it easier for developers > > to use Numpy with an in-built optimized Blas. > > The Numpy Windows binaries distributed in the numpy project at > sourceforge.net are compiled with ATLAS, which should count as an > optimized BLAS. I don't recall what's the situation with OSX binaries, > but I'd believe they're with Atlas too. > > If you are suggesting bundling OpenBLAS with Numpy source releases --- > arguments against: > > OpenBLAS is big, and still rapidly moving. Moreover, bundling it with > Numpy does not really make it any easier to build. > > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From francesc at continuum.io Mon Jan 27 08:06:14 2014 From: francesc at continuum.io (Francesc Alted) Date: Mon, 27 Jan 2014 14:06:14 +0100 Subject: [Numpy-discussion] ANN: numexpr 2.3 (final) released In-Reply-To: References: <52E4BB94.80300@continuum.io> Message-ID: <52E659C6.3050706@continuum.io> Not really. numexpr is mostly about element-wise operations in dense matrices. You should look to another package for that. Francesc On 1/27/14, 10:18 AM, Dinesh Vadhia wrote: > Francesc: Does numexpr support scipy sparse matrices? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted From ndbecker2 at gmail.com Mon Jan 27 08:28:23 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Mon, 27 Jan 2014 08:28:23 -0500 Subject: [Numpy-discussion] another interesting high performance vector lib (yeppp) Message-ID: http://www.yeppp.info/ From cmkleffner at gmail.com Mon Jan 27 08:57:44 2014 From: cmkleffner at gmail.com (Carl Kleffner) Date: Mon, 27 Jan 2014 14:57:44 +0100 Subject: [Numpy-discussion] another interesting high performance vector lib (yeppp) In-Reply-To: References: Message-ID: a similar SIMD based library for transcendental function ist SLEEF http://shibatch.sourceforge.net/ . An inclompete wrapper can be found here: https://github.com/nikolaynag/avxmath I suppose that Intels VML has a better coverage over YEPPP or SLEEF. Carl 2014-01-27 Neal Becker > http://www.yeppp.info/ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Mon Jan 27 09:10:45 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Mon, 27 Jan 2014 15:10:45 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: <20140126170640.GB4256@shinobi> References: <20140126170640.GB4256@shinobi> Message-ID: On Sun, Jan 26, 2014 at 6:06 PM, St?fan van der Walt wrote: > On Sun, 26 Jan 2014 16:40:44 +0200, Pauli Virtanen wrote: > > The Numpy Windows binaries distributed in the numpy project at > > sourceforge.net are compiled with ATLAS, which should count as an > > optimized BLAS. I don't recall what's the situation with OSX binaries, > > but I'd believe they're with Atlas too. > > Was a switch made away from Accelerate after this? > http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html > No, nothing changed. Still using Accelerate for all official binaries. Ralf > > St?fan > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Mon Jan 27 15:04:30 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Mon, 27 Jan 2014 21:04:30 +0100 Subject: [Numpy-discussion] windows and C99 math Message-ID: <52E6BBCE.8070304@googlemail.com> hi, numpys no-C99 fallback keeps turning up issues in corner cases, e.g. hypot https://github.com/numpy/numpy/issues/2385 log1p https://github.com/numpy/numpy/issues/4225 these only seem to happen on windows, on linux and mac it seems to use the C99 math library just fine. Are our binary builds for windows not correct or does windows just not support C99 math? Hopefully it is the former. Any insight is appreciated (and patches to fix the build even more!) Cheers, Julian From nouiz at nouiz.org Mon Jan 27 15:23:46 2014 From: nouiz at nouiz.org (=?ISO-8859-1?Q?Fr=E9d=E9ric_Bastien?=) Date: Mon, 27 Jan 2014 15:23:46 -0500 Subject: [Numpy-discussion] windows and C99 math In-Reply-To: <52E6BBCE.8070304@googlemail.com> References: <52E6BBCE.8070304@googlemail.com> Message-ID: Just a guess as I don't make those binaries, but I think they are done with Visual Studio and it only support C89... We need to back port some of our c code for windows for GPU as nvcc use VS and it don't support C99. Fred On Mon, Jan 27, 2014 at 3:04 PM, Julian Taylor wrote: > hi, > numpys no-C99 fallback keeps turning up issues in corner cases, e.g. > hypot https://github.com/numpy/numpy/issues/2385 > log1p https://github.com/numpy/numpy/issues/4225 > > these only seem to happen on windows, on linux and mac it seems to use > the C99 math library just fine. > > Are our binary builds for windows not correct or does windows just not > support C99 math? > > Hopefully it is the former. Any insight is appreciated (and patches to > fix the build even more!) > > Cheers, > Julian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion From charles at crunch.io Mon Jan 27 15:43:30 2014 From: charles at crunch.io (Charles G. Waldman) Date: Mon, 27 Jan 2014 14:43:30 -0600 Subject: [Numpy-discussion] bug in comparing object arrays to None (?) Message-ID: Hi Numpy folks. I just noticed that comparing an array of type 'object' to None does not behave as I expected. Is this a feature or a bug? (I can take a stab at fixing it if it's a bug, as I believe it is). >>> np.version.full_version '1.8.0' >>> a = np.array(['Frank', None, 'Nancy']) >>> a array(['Frank', None, 'Nancy'], dtype=object) >>> a == 'Frank' array([ True, False, False], dtype=bool) # Return value is an array >>> a == None False # Return value is scalar (BUG?) From warren.weckesser at gmail.com Mon Jan 27 15:51:44 2014 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Mon, 27 Jan 2014 15:51:44 -0500 Subject: [Numpy-discussion] bug in comparing object arrays to None (?) In-Reply-To: References: Message-ID: On Mon, Jan 27, 2014 at 3:43 PM, Charles G. Waldman wrote: > Hi Numpy folks. > > I just noticed that comparing an array of type 'object' to None does > not behave as I expected. Is this a feature or a bug? (I can take a > stab at fixing it if it's a bug, as I believe it is). > > >>> np.version.full_version > '1.8.0' > > >>> a = np.array(['Frank', None, 'Nancy']) > > >>> a > array(['Frank', None, 'Nancy'], dtype=object) > > >>> a == 'Frank' > array([ True, False, False], dtype=bool) > # Return value is an array > > >>> a == None > False > # Return value is scalar (BUG?) > Looks like a fix is in progress: https://github.com/numpy/numpy/pull/3514 Warren _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Mon Jan 27 18:04:13 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Mon, 27 Jan 2014 23:04:13 +0000 (UTC) Subject: [Numpy-discussion] windows and C99 math References: <52E6BBCE.8070304@googlemail.com> Message-ID: <2030236615412555553.547415sturla.molden-gmail.com@news.gmane.org> Julian Taylor wrote: > Are our binary builds for windows not correct or does windows just not > support C99 math? Microsoft's C compiler does not support C99. It is not an OS issue. Use gcc, clang or Intel icc instead, and C99 is supported. Sturla From cournape at gmail.com Mon Jan 27 18:12:55 2014 From: cournape at gmail.com (David Cournapeau) Date: Mon, 27 Jan 2014 23:12:55 +0000 Subject: [Numpy-discussion] windows and C99 math In-Reply-To: <52E6BBCE.8070304@googlemail.com> References: <52E6BBCE.8070304@googlemail.com> Message-ID: On Mon, Jan 27, 2014 at 8:04 PM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > hi, > numpys no-C99 fallback keeps turning up issues in corner cases, e.g. > hypot https://github.com/numpy/numpy/issues/2385 > log1p https://github.com/numpy/numpy/issues/4225 > > these only seem to happen on windows, on linux and mac it seems to use > the C99 math library just fine. > > Are our binary builds for windows not correct or does windows just not > support C99 math? > Ms compilers have not supported much of C > C89. Up to recently, they even said publicly something close to "we don't care about C, C is legacy and you should use C++": http://herbsutter.com/2012/05/03/reader-qa-what-about-vc-and-c99/ But it looks like they are finally changing their stance: http://blogs.msdn.com/b/vcblog/archive/2013/07/19/c99-library-support-in-visual-studio-2013.aspx Of course, it will be a while before we can rely on this, but hey, it only took them 14 years ! David > > Hopefully it is the former. Any insight is appreciated (and patches to > fix the build even more!) > > Cheers, > Julian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From JMcGlinchy at esri.com Tue Jan 28 13:44:51 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Tue, 28 Jan 2014 18:44:51 +0000 Subject: [Numpy-discussion] scipy image processing memory leak in python 2.7? Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> Hi numpy list! I am trying to do some image processing on a number of images, 72 to be specific. I am seeing the python memory usage continually increase. I will say this is being done also using arcpy so that COULD be an issue, but want to rule out scipy first since I am only using arcpy to grab a set of polygons and convert them to a binary image, and to save images to disk. I pass the raster to my numpy/scipy processing. The code is organized as: 1) Main loop iterating over features 2) Function which does image processing using scipy 3) Class which implements some of the image processing features I placed 2 sleep calls in the code.. one at the end of the image processing function and one at the end of each iteration in the main loop. I am seeing a 20MB memory increase each iteration while looking at my processes in Task Manager: Iteration 1: 204,312K and 209,908K, respectively Iteration 2: 233,728K and 230,192K, respectively Iteration 3: 255,676K and 250,600K, respectively Any ideas? Thanks so much! Joe 1) Main loop #for each feature class, convert to raster by "Value" and reclass to 1 and 0 temp_raster = os.path.join(scratchWorkspace,"temp_1.tif") temp_raster_elev = os.path.join(scratchWorkspace,"temp_elev.tif") temp_reclass = os.path.join(scratchWorkspace,"temp_reclass.tif") remap = arcpy.sa.RemapValue([[1,1], ["NoData",0]]) for i,fc in enumerate(potWaterFeatNames): #try to delete the temp rasters in case they exist try: #delete temp rasters arcpy.Delete_management(temp_raster) arcpy.Delete_management(temp_raster_elev) arcpy.Delete_management(temp_reclass) except e: arcpy.AddMessage(e.message()) #clear in memory workspace" arcpy.Delete_management("in_memory/temp_table") #print "on feature {} of {}".format(i+1, len(potWaterFeatNames)) arcpy.AddMessage("on feature {} of {}".format(i+1, len(potWaterFeatNames))) arcpy.AddMessage("fc = {}".format(fc)) arcpy.PolygonToRaster_conversion(fc, "Value", temp_raster, "", "", 1) outReclass = arcpy.sa.Reclassify(temp_raster,"Value",remap) outReclass.save(temp_reclass) del outReclass #call function to process connected components try: proc_im = processBinaryImage(temp_reclass, scratchWorkspace, sm_pixels) except Exception as e: print "FAILED! scipy image processing" arcpy.AddMessage(e.message()) #convert processed raster to polygons try: out_fc = os.path.join(outWorkspace,fc + "_cleaned") arcpy.RasterToPolygon_conversion(proc_im, out_fc) arcpy.Delete_management(proc_im) except Exception as e: print "FAILED! converting cleaned binary raster to polygon" arcpy.AddMessage(e.message()) #delete zero value polyons, gridcode = 0 try: uc = arcpy.UpdateCursor(out_fc, "gridcode=0",arcpy.Describe(out_fc).spatialReference) #loop through 0 rows and delete for row in uc: uc.deleteRow(row) del uc except Exception as e: print "FAILED! deleting rows from table" arcpy.AddMessage(e.message()) #check that number of polygons with gridcode = 1 is greater than 0 count = arcpy.GetCount_management(out_fc) #add elevation field back in if (count > 0): arcpy.PolygonToRaster_conversion(fc, "z_val", temp_raster_elev, "", "", 1) arcpy.sa.ZonalStatisticsAsTable(out_fc, "Id", temp_raster_elev, "in_memory/temp_table","#","MEAN") arcpy.JoinField_management(out_fc, "Id", "in_memory/temp_table", "Id", ["MEAN"]) else: arcpy.Delete_management(out_fc) #delete temp rasters arcpy.Delete_management(temp_raster) arcpy.Delete_management(temp_raster_elev) arcpy.Delete_management(temp_reclass) #python garbage collection #collected = gc.collect() #print "Garbage collector: collected %d objects." % (collected) print "sleeping for 10 seconds" time.sleep(10) 2) Image processing function def processBinaryImage(imageName, save_dir, sm_pixels): fname = os.path.basename(imageName) imRaster = arcpy.Raster(imageName) #Grab AGS info from image #use describe module to grab info descData = arcpy.Describe(imRaster) cellSize = descData.meanCellHeight extent = descData.Extent spatialReference = descData.spatialReference pnt = arcpy.Point(extent.XMin, extent.YMin) del imRaster converter = ConvertRasterNumpy(); imArray = converter.rasterToNumpyArray(imageName) imMin = np.min(imArray) imMax = np.max(imArray) print imMin, imMax arcpy.AddMessage("image min: " + str(imMin)) arcpy.AddMessage("image max: " + str(imMax)); #other flags show_image = False #verbose save_flag = False #under verbose gray = False #threshold but keep gray levels make_binary = False #flag if grayscale is true, to binarize final image save_AGS_rasters = True #create filter object filter = thresholdMedianFilter() lowValue = 0 tImage = filter.thresholdImage(imArray, lowValue, gray) #median filter image filtImage1 = filter.medianFiltImage(tImage,1) #create structuring element for morphological operations sElem = np.array([[0., 1., 0.], [1., 1., 1.], [0., 1., 0.]]) #open filtered image gray = False mImage = filter.mOpenImage(filtImage1, sElem, gray) #set list to store info change num_its = 100 the_change = np.zeros(num_its) for it in range(100): prev = mImage filtImage = filter.medianFiltImage(mImage,1) mImage = filter.mOpenImage(filtImage, sElem, gray) #calculate difference (m_info, z_info) = filter.calcInfoChange(prev, mImage) the_change[it] = z_info del filtImage del prev #if the change is less than 5% of the initial change, exit cur_perc = the_change[it]/the_change[0]*100 if cur_perc < 5.0: print "exiting filter on iteration " + str(it) print "change is less than 5% (this iteration: " + str(cur_perc) + ")" break ############################################################################ # now we have a binary mask. Let's find and label the connected components # ############################################################################ #clear some space del tImage del filtImage1 del m_info label_im_init, nb_labels = ndimage.label(mImage) #Compute size, mean_value, etc. of each region: label_im = label_im_init.copy() del label_im_init sizes = ndimage.sum(mImage, label_im, range(nb_labels + 1)) mean_vals = ndimage.sum(imArray, label_im, range(1, nb_labels + 1)) #clean up small components mask_size = sizes < sm_pixels remove_pixel = mask_size[label_im] label_im[remove_pixel] = 0; labels = np.unique(label_im) label_im = np.searchsorted(labels, label_im) #make label image to a binary image, and convert to arcgis raster label_im[label_im > 0] = 1 label_im = np.array(label_im, dtype = 'float32') print label_im.dtype saveit = False if ~saveit: outRaster = save_dir + "\\" + fname[:-4] + "_CC_" + str(sm_pixels) + ".tif" temp = arcpy.NumPyArrayToRaster(label_im,pnt,cellSize,cellSize) arcpy.DefineProjection_management(temp, spatialReference) arcpy.CopyRaster_management(temp, outRaster, "DEFAULTS","0","","","","8_BIT_UNSIGNED") #clear more space del mImage del nb_labels del sizes del mean_vals del mask_size del remove_pixel del label_im del labels del temp del the_change del sElem del filter del imArray print 'sleeping' time.sleep(20) return outRaster 3) Image processing class class thresholdMedianFilter(): def thresholdImage(self, imArray, thresh, binary = True): """ threshold the image. values equal or below thresh are 0, above are 1""" tImage = imArray tImage[tImage <= thresh] = 0 if binary: tImage[tImage > thresh] = 1 return tImage def medianFiltImage(self,imArray,n=1): """median filter the image n amount of times. a single time is the default""" for n in range(n): prev = imArray imArray = signal.medfilt2d(prev) del prev return imArray def mOpenImage(self,imArray, sElem, gray = False): """ morphological opening """ #Mnp = np.array(morph.binary_erosion(imArray, sElem)) #Mnp = np.array(morph.binary_dilation(Mnp, sElem)) if gray: imArray1 = np.array(morph.grey_dilation(imArray, structure = sElem)) imArray2 = np.array(morph.grey_erosion(imArray1, structure = sElem), dtype = 'float32') del imArray1 return imArray2 else: Mnp1 = np.array(morph.binary_dilation(imArray, sElem)) Mnp2 = np.array(morph.binary_erosion(Mnp1, sElem), dtype = 'float32') del Mnp1 return Mnp2 def calcInfoChange(self,imArray1, imArray2): """calculate entropy of an image""" diff = imArray1 - imArray2 m_norm = sum(abs(diff)) #Manhattan Norm z_norm = norm(diff.ravel(), 0) #Zero norm del diff return (m_norm, z_norm) Joe McGlinchy | Imagery Scientist Database Services - 3D and Imagery Team ESRI || 380 New York St. || Redlands, CA 92373 || USA T 909-793-2853, ext. 4783 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Tue Jan 28 13:57:53 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Tue, 28 Jan 2014 19:57:53 +0100 Subject: [Numpy-discussion] scipy image processing memory leak in python 2.7? In-Reply-To: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> Message-ID: <52E7FDB1.8010109@googlemail.com> On 28.01.2014 19:44, Joseph McGlinchy wrote: > Hi numpy list! > > > > I am trying to do some image processing on a number of images, 72 to be > specific. I am seeing the python memory usage continually increase. which version of scipy are you using? there is a memory leak in ndimage.label in version 0.13 which will be fixed in 0.13.3 due soon. see https://github.com/scipy/scipy/issues/3148 From JMcGlinchy at esri.com Tue Jan 28 14:08:17 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Tue, 28 Jan 2014 19:08:17 +0000 Subject: [Numpy-discussion] scipy image processing memory leak in python 2.7? In-Reply-To: <52E7FDB1.8010109@googlemail.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> <52E7FDB1.8010109@googlemail.com> Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D2226046E7@RED-INF-EXMB-P1.esri.com> >>> scipy.version.version '0.11.0b1' >>> numpy.version.version '1.6.1' Is that, or any other, memory leaks present in these versions? -----Original Message----- From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Julian Taylor Sent: Tuesday, January 28, 2014 10:58 AM To: numpy-discussion at scipy.org Subject: Re: [Numpy-discussion] scipy image processing memory leak in python 2.7? On 28.01.2014 19:44, Joseph McGlinchy wrote: > Hi numpy list! > > > > I am trying to do some image processing on a number of images, 72 to > be specific. I am seeing the python memory usage continually increase. which version of scipy are you using? there is a memory leak in ndimage.label in version 0.13 which will be fixed in 0.13.3 due soon. see https://github.com/scipy/scipy/issues/3148 _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion From ralf.gommers at gmail.com Wed Jan 29 09:16:01 2014 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Wed, 29 Jan 2014 15:16:01 +0100 Subject: [Numpy-discussion] scipy image processing memory leak in python 2.7? In-Reply-To: <6D6DE7712442DD4CAEFE7FB9D86210D2226046E7@RED-INF-EXMB-P1.esri.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> <52E7FDB1.8010109@googlemail.com> <6D6DE7712442DD4CAEFE7FB9D86210D2226046E7@RED-INF-EXMB-P1.esri.com> Message-ID: On Tue, Jan 28, 2014 at 8:08 PM, Joseph McGlinchy wrote: > >>> scipy.version.version > '0.11.0b1' > > >>> numpy.version.version > '1.6.1' > > Is that, or any other, memory leaks present in these versions? > That memory leak isn't, it's only present in 0.13.0-0.13.2. I'm not aware of any other leaks that were fixed since 0.11.0 either, but still it would be worth checking if you see the same issue with current scipy master. Ralf > > > -----Original Message----- > From: numpy-discussion-bounces at scipy.org [mailto: > numpy-discussion-bounces at scipy.org] On Behalf Of Julian Taylor > Sent: Tuesday, January 28, 2014 10:58 AM > To: numpy-discussion at scipy.org > Subject: Re: [Numpy-discussion] scipy image processing memory leak in > python 2.7? > > On 28.01.2014 19:44, Joseph McGlinchy wrote: > > Hi numpy list! > > > > > > > > I am trying to do some image processing on a number of images, 72 to > > be specific. I am seeing the python memory usage continually increase. > > which version of scipy are you using? > there is a memory leak in ndimage.label in version 0.13 which will be > fixed in 0.13.3 due soon. > see https://github.com/scipy/scipy/issues/3148 > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From JMcGlinchy at esri.com Wed Jan 29 13:19:16 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Wed, 29 Jan 2014 18:19:16 +0000 Subject: [Numpy-discussion] scipy image processing memory leak in python 2.7? In-Reply-To: References: <6D6DE7712442DD4CAEFE7FB9D86210D222604449@RED-INF-EXMB-P1.esri.com> <52E7FDB1.8010109@googlemail.com> <6D6DE7712442DD4CAEFE7FB9D86210D2226046E7@RED-INF-EXMB-P1.esri.com> Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D222604A2F@RED-INF-EXMB-P1.esri.com> I have updated to scipy 0.13.2 (specifically scipy-0.13.2-win32-superpack-python2.7.exe (59.1 MB)) at http://sourceforge.net/projects/scipy/files/ I am still seeing the memory leak, and it appears to be a little larger (25MB) each iteration. Here are the numbers I am seeing on my python process: Iteration 1, sleep1: 207,716k Iteration 1, sleep2: 212,112k Iteration 2, sleep1: 236,488k Iteration 2, sleep2: 237,160k Iteration 3, sleep1: 261,168k Iteration 3, sleep2: 261,264k Iteration 4, sleep1: 285,044k Iteration 4, sleep2: 285,724k -Joe From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Ralf Gommers Sent: Wednesday, January 29, 2014 6:16 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] scipy image processing memory leak in python 2.7? On Tue, Jan 28, 2014 at 8:08 PM, Joseph McGlinchy > wrote: >>> scipy.version.version '0.11.0b1' >>> numpy.version.version '1.6.1' Is that, or any other, memory leaks present in these versions? That memory leak isn't, it's only present in 0.13.0-0.13.2. I'm not aware of any other leaks that were fixed since 0.11.0 either, but still it would be worth checking if you see the same issue with current scipy master. Ralf -----Original Message----- From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Julian Taylor Sent: Tuesday, January 28, 2014 10:58 AM To: numpy-discussion at scipy.org Subject: Re: [Numpy-discussion] scipy image processing memory leak in python 2.7? On 28.01.2014 19:44, Joseph McGlinchy wrote: > Hi numpy list! > > > > I am trying to do some image processing on a number of images, 72 to > be specific. I am seeing the python memory usage continually increase. which version of scipy are you using? there is a memory leak in ndimage.label in version 0.13 which will be fixed in 0.13.3 due soon. see https://github.com/scipy/scipy/issues/3148 _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.root at ou.edu Wed Jan 29 14:10:02 2014 From: ben.root at ou.edu (Benjamin Root) Date: Wed, 29 Jan 2014 14:10:02 -0500 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: References: Message-ID: Hmmm, I see no reason why that would eat up memory. I just tried it out on my own system (numpy 1.6.1, CentOS 6, python 2.7.1), and had no issues, Memory usage stayed flat for the 10 seconds it took to go through the loop. Note, I am not using ATLAS or BLAS, so maybe the issue lies there? (i don't know if numpy defers the dot-product over to ATLAS or BLAS if they are available) -------------- next part -------------- An HTML attachment was scrubbed... URL: From JMcGlinchy at esri.com Wed Jan 29 14:16:44 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Wed, 29 Jan 2014 19:16:44 +0000 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: References: Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> Perhaps it is an ESRI/Arcpy issue then. I don't see anything that could be doing that, though, as it is very minimal. From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Benjamin Root Sent: Wednesday, January 29, 2014 11:10 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Memory leak in numpy? Hmmm, I see no reason why that would eat up memory. I just tried it out on my own system (numpy 1.6.1, CentOS 6, python 2.7.1), and had no issues, Memory usage stayed flat for the 10 seconds it took to go through the loop. Note, I am not using ATLAS or BLAS, so maybe the issue lies there? (i don't know if numpy defers the dot-product over to ATLAS or BLAS if they are available) -------------- next part -------------- An HTML attachment was scrubbed... URL: From JMcGlinchy at esri.com Wed Jan 29 14:39:44 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Wed, 29 Jan 2014 19:39:44 +0000 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D222604A8B@RED-INF-EXMB-P1.esri.com> Upon further investigation, I do believe it is within the scipy code where there is a leak. I commented out my call to processBinaryImage(), which is all scipy code calls, and my memory usage remains flat with approximately a 1MB variation. Any ideas? Right now I am getting around it by checking to see how far I got through my dataset, but I have to restart the program after each memory crash. From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Joseph McGlinchy Sent: Wednesday, January 29, 2014 11:17 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Memory leak in numpy? Perhaps it is an ESRI/Arcpy issue then. I don't see anything that could be doing that, though, as it is very minimal. From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Benjamin Root Sent: Wednesday, January 29, 2014 11:10 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Memory leak in numpy? Hmmm, I see no reason why that would eat up memory. I just tried it out on my own system (numpy 1.6.1, CentOS 6, python 2.7.1), and had no issues, Memory usage stayed flat for the 10 seconds it took to go through the loop. Note, I am not using ATLAS or BLAS, so maybe the issue lies there? (i don't know if numpy defers the dot-product over to ATLAS or BLAS if they are available) -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Jan 29 14:44:19 2014 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 29 Jan 2014 19:44:19 +0000 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: <6D6DE7712442DD4CAEFE7FB9D86210D222604A8B@RED-INF-EXMB-P1.esri.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> <6D6DE7712442DD4CAEFE7FB9D86210D222604A8B@RED-INF-EXMB-P1.esri.com> Message-ID: On Wed, Jan 29, 2014 at 7:39 PM, Joseph McGlinchy wrote: > Upon further investigation, I do believe it is within the scipy code where > there is a leak. I commented out my call to processBinaryImage(), which is > all scipy code calls, and my memory usage remains flat with approximately a > 1MB variation. Any ideas? I'd suggest continuing along this line, and keep chopping things out until you have a minimal program that still shows the problem -- that'll probably make it much clearer where the problem is actually coming from... -n From jtaylor.debian at googlemail.com Wed Jan 29 14:52:32 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 29 Jan 2014 20:52:32 +0100 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: References: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> <6D6DE7712442DD4CAEFE7FB9D86210D222604A8B@RED-INF-EXMB-P1.esri.com> Message-ID: <52E95C00.2070406@googlemail.com> On 29.01.2014 20:44, Nathaniel Smith wrote: > On Wed, Jan 29, 2014 at 7:39 PM, Joseph McGlinchy wrote: >> Upon further investigation, I do believe it is within the scipy code where >> there is a leak. I commented out my call to processBinaryImage(), which is >> all scipy code calls, and my memory usage remains flat with approximately a >> 1MB variation. Any ideas? > > I'd suggest continuing along this line, and keep chopping things out > until you have a minimal program that still shows the problem -- > that'll probably make it much clearer where the problem is actually > coming from... > > -n depending on how long the program runs you can try running it under massif the valgrind memory usage proftool, that should give you a good clue where the source is. From JMcGlinchy at esri.com Wed Jan 29 17:13:01 2014 From: JMcGlinchy at esri.com (Joseph McGlinchy) Date: Wed, 29 Jan 2014 22:13:01 +0000 Subject: [Numpy-discussion] Memory leak in numpy? In-Reply-To: <52E95C00.2070406@googlemail.com> References: <6D6DE7712442DD4CAEFE7FB9D86210D222604A6F@RED-INF-EXMB-P1.esri.com> <6D6DE7712442DD4CAEFE7FB9D86210D222604A8B@RED-INF-EXMB-P1.esri.com> <52E95C00.2070406@googlemail.com> Message-ID: <6D6DE7712442DD4CAEFE7FB9D86210D222604AEF@RED-INF-EXMB-P1.esri.com> Unfortunately I do not have Linux or much time to invest in researching and learning an alternative to Valgrind :/ My current workaround, which works very well, is to move my scipy part of the script to its own script and then use os.system() to call it with the appropriate arguments. Thanks everyone for the replies! Is there a proper way to close the thread? -Joe -----Original Message----- From: numpy-discussion-bounces at scipy.org [mailto:numpy-discussion-bounces at scipy.org] On Behalf Of Julian Taylor Sent: Wednesday, January 29, 2014 11:53 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Memory leak in numpy? On 29.01.2014 20:44, Nathaniel Smith wrote: > On Wed, Jan 29, 2014 at 7:39 PM, Joseph McGlinchy wrote: >> Upon further investigation, I do believe it is within the scipy code >> where there is a leak. I commented out my call to >> processBinaryImage(), which is all scipy code calls, and my memory >> usage remains flat with approximately a 1MB variation. Any ideas? > > I'd suggest continuing along this line, and keep chopping things out > until you have a minimal program that still shows the problem -- > that'll probably make it much clearer where the problem is actually > coming from... > > -n depending on how long the program runs you can try running it under massif the valgrind memory usage proftool, that should give you a good clue where the source is. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion From jaime.frio at gmail.com Thu Jan 30 02:38:42 2014 From: jaime.frio at gmail.com (=?ISO-8859-1?Q?Jaime_Fern=E1ndez_del_R=EDo?=) Date: Wed, 29 Jan 2014 23:38:42 -0800 Subject: [Numpy-discussion] ENH: Type specific binary search functions for `searchsorted` Message-ID: Hi, I have just added a new PR: https://github.com/numpy/numpy/pull/4244 >From the commit message: This PR replaces the generic binary search functions used by `searchsorted` with type specific ones for numeric types. This results in a speed-up of calls to `searchsorted` which is highly dependent on the size of the 'haystack' and the 'needle', with typical values for large enough needles in the 1.5x - 3.0x for direct searches (i.e. without a sorter argument) and 1.2x - 2.0x for indirect searches. A summary benchmark on float and int arrays can be found here. Furthermore, the type specific binary search functions can take strided inputs for all their arguments, which is a step in the right direction to eventually add an axis argument to`searchsorted`. Any comments and reviews are more than welcome! Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Thu Jan 30 03:11:44 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Thu, 30 Jan 2014 09:11:44 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: On 27/01/14 12:01, Carl Kleffner wrote: > Did you consider to check the experimental binaries on > https://code.google.com/p/mingw-w64-static/ for Python-2.7? These > binaries has been build with with a customized mingw-w64 toolchain. > These builds are fully statically build and are link against the MSVC90 > runtime libraries (gcc runtime is linked statically) and OpenBLAS. > > Carl Building OpenBLAS and LAPACK is very easy. I used TDM-GCC for Win64. It's just two makefile (not even a configure script). OpenBLAS and LAPACK are probably the easiest libraries to build there is. The main problem for using OpenBLAS with NumPy and SciPy on Windows is that Python 2.7 from www.python.org does not ship with libpython27.a for 64-bit Python, so we need to maintain our own. Also, GNU compilers are required to build OpenBLAS. This means we have to build our own libgfortran as well. The binary is incompatible with the MSVC runtime we use. I.e. not impossible, but painful. http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063740.html Sturla From cmkleffner at gmail.com Thu Jan 30 06:01:11 2014 From: cmkleffner at gmail.com (Carl Kleffner) Date: Thu, 30 Jan 2014 12:01:11 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: I agree, building OpenBLAS with mingw-w64 is a snap. The problem is choosing and adapting a mingw based gcc-toolchain and patching the numpy sources according to this toolchain. For the last years I was a happy user of the mingw.org based toolchain. After searching for a 64-bit alternative I stumbled upon mingw-w64 and its derivatives. I tried out several mingw-w64 based toolchains, i.e. TDM, equation.com and more. All mingw-w64 derivatives have there pros and cons. You may know, that you have to choose not only for bitness (32 vs 64 bit) and gcc version, but also for exception handling (sjlj, dwarf, seh) and the way threading is supported (win32 vs. posix threads). Not all of these derivatives describe what they use in a clearly manner. And the TDM toolchain i.e. has introduced some API incompatibilities to standard gcc-toolchains due to its own patches. A serious problem is gcc linking to runtimes other than msvcrt.dll. Mingw-w64 HAS import libraries fort msvcr80, msvcr90, msvcr100, msvcr110. However correct linkage to say to msvcr90 is more than just adding -lmsvcr90 to the linker command. You have to create a spec file for gcc and adapt it to your need. It is also very important (especially for msvcr90) to link manifest files to the binaries you create. This has to do with the way Microsoft searches for DLLs. "Akruis" (Anselm Kruis, science + computing AG) did the job to iron out these problems concerning mingw-w64 and python. Unfortunately his blog disappears for some time now. The maintainers of the mingw-w64 toolchains DO NOT focus on the problem with alternative runtime linking. A related problem is that symbols are used by OpenMP and winpthreads you can resolve in msvcrt.dll, but not in msvcr90.dll, so "_ftime"has to be exchanged with "ftime64" if you want to use OpenMP or winpthreads. In the end my solution was to build my own toolchain. This is time consuming but simple with the help of the set of scripts you can find here: https://github.com/niXman/mingw-builds/tree/develop With this set of scripts and msys2 http://sourceforge.net/projects/msys2/and my own "_ftime" patch I build a 'statically' mingw-w64 toolchain. Let me say a word about statically build: GCC can be build statically. This means, that all of the C, C++, Gfortran runtime is statically linked to every binary. There is not much bloat as you might expect when the binaries are stripped. And yes, it is necessary to build an import lib for python. This import lib is specific to the toolchain you are going to use. My idea is to create and add all import libs (py2.6 up to py3.4) to the toolchain and do not use any of the importlibs that might exist in the python/libs/ folder. My conclusion is: mixing different compiler architectures for building Python extensions on Windows is possible but makes it necessary to build a 'vendor' gcc toolchain. I did not find the time to put my latest binaries on the web or make numpy pull requests the github way due to my workload. Hopefully I find some time next weekend. with best regards Carl 2014-01-30 Sturla Molden : > On 27/01/14 12:01, Carl Kleffner wrote: > > Did you consider to check the experimental binaries on > > https://code.google.com/p/mingw-w64-static/ for Python-2.7? These > > binaries has been build with with a customized mingw-w64 toolchain. > > These builds are fully statically build and are link against the MSVC90 > > runtime libraries (gcc runtime is linked statically) and OpenBLAS. > > > > Carl > > Building OpenBLAS and LAPACK is very easy. I used TDM-GCC for Win64. > It's just two makefile (not even a configure script). OpenBLAS and > LAPACK are probably the easiest libraries to build there is. > > The main problem for using OpenBLAS with NumPy and SciPy on Windows is > that Python 2.7 from www.python.org does not ship with libpython27.a for > 64-bit Python, so we need to maintain our own. Also, GNU compilers are > required to build OpenBLAS. This means we have to build our own > libgfortran as well. The binary is incompatible with the MSVC runtime we > use. I.e. not impossible, but painful. > > http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063740.html > > > Sturla > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Jan 30 06:02:10 2014 From: sebastian at sipsolutions.net (sebastian) Date: Thu, 30 Jan 2014 12:02:10 +0100 Subject: [Numpy-discussion] Deprecation of boolean substract and negative (the - operator) Message-ID: <7d4e20513cb7b8cc7493cc1d411176f3@sipsolutions.net> Hey all, recently we had a small discussion about deprecating some of the operators for boolean arrays. This discussion seemed to have ended by large in the consense that while most boolean operators are well defined and should be kept, the `-` one is not very well defined on boolean arrays and has the problem of the inconsistency: - np.array(False) == True False - np.array(False) == False # leading to: False - (-np.arry(False)) != False + np.array(False) So that it is preferable to use one of the binary operators for this operation. For now this would only be a deprecation, but both operators are probably used out there. So if you have any serious doubt about starting this deprecation please note it here. The Pull request to implement such a deprecation is: https://github.com/numpy/numpy/pull/4105 Regards, Sebastian From sturla.molden at gmail.com Thu Jan 30 06:38:21 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Thu, 30 Jan 2014 12:38:21 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: On 30/01/14 12:01, Carl Kleffner wrote: > My conclusion is: mixing different compiler architectures for building > Python extensions on Windows is possible but makes it necessary to build > a 'vendor' gcc toolchain. Right. This makes a nice twist on the infamous XML and Regex story: - There once was a man who had a problem building NumPy. Then he thought, "I'll just use a custom compiler toolchain." Now he had two problems. Setting up a custom GNU toolchain for NumPy on Windows would not be robust enough. And when there be bugs, we have two places to look for them instead of one. By using a tested and verified compiler toolchain, there is one place less things can go wrong. I would rather consider distributing NumPy binaries linked with MKL, if Intel's license allows it. Sturla From cmkleffner at gmail.com Thu Jan 30 07:29:45 2014 From: cmkleffner at gmail.com (Carl Kleffner) Date: Thu, 30 Jan 2014 13:29:45 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: I fully agree with you. But you have to consider the following: - the officially mingw-w64 toolchains are build almost the same way. The only difference is, that they have non-static builds (that would be preferable for C++ development BTW) - you won't get the necessary addons like spec-files, manifest resource files for msvcr90,100 from there. - there is a urgent need for a free and portable C,C++, Fortran compiler for Windows with full blas, lapack support. You won't get that with numpy-MKL, but with a GNU toolchain and OpenBLAS. Not everyone can buy the Intel Fortran compiler or is allowed to install it. - you can build 3rd party extensions which use blas,lapack directly or with cython with such a toolchain regardless if you use numpy/scipy-MKL or mingw-based numpy/scipy - The licence question of numpy-MKL is unclear. I know that MKL is linked in statically. But can I redistribite it myself or use it in commercial context without buying a Intel licence? Carl 2014-01-30 Sturla Molden : > On 30/01/14 12:01, Carl Kleffner wrote: > > > My conclusion is: mixing different compiler architectures for building > > Python extensions on Windows is possible but makes it necessary to build > > a 'vendor' gcc toolchain. > > Right. > > This makes a nice twist on the infamous XML and Regex story: > > - There once was a man who had a problem building NumPy. Then he > thought, "I'll just use a custom compiler toolchain." Now he had two > problems. > > Setting up a custom GNU toolchain for NumPy on Windows would not be > robust enough. And when there be bugs, we have two places to look for > them instead of one. > > By using a tested and verified compiler toolchain, there is one place > less things can go wrong. I would rather consider distributing NumPy > binaries linked with MKL, if Intel's license allows it. > > Sturla > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Thu Jan 30 13:43:40 2014 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 30 Jan 2014 10:43:40 -0800 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: Hi, On Thu, Jan 30, 2014 at 4:29 AM, Carl Kleffner wrote: > I fully agree with you. But you have to consider the following: > > - the officially mingw-w64 toolchains are build almost the same way. The > only difference is, that they have non-static builds (that would be > preferable for C++ development BTW) > - you won't get the necessary addons like spec-files, manifest resource > files for msvcr90,100 from there. > - there is a urgent need for a free and portable C,C++, Fortran compiler for > Windows with full blas, lapack support. You won't get that with numpy-MKL, > but with a GNU toolchain and OpenBLAS. Not everyone can buy the Intel > Fortran compiler or is allowed to install it. Thanks for doing this - I'd love to see the toolchain. If there's anything I can do to help, please let me know. The only obvious thing I can think of is using our buildbots or just the spare machines we have: http://nipy.bic.berkeley.edu/ but if you can think of anything else, please let me know. Cheers, Matthew From jenny.stone125 at gmail.com Thu Jan 30 18:01:01 2014 From: jenny.stone125 at gmail.com (jennifer stone) Date: Fri, 31 Jan 2014 04:31:01 +0530 Subject: [Numpy-discussion] Suggestions for GSoC Projects Message-ID: With GSoC 2014 being round the corner, I hereby put up few projects for discussion that I would love to pursue as a student. Guidance, suggestions are cordially welcome:- 1. If I am not mistaken, contour integration is not supported by SciPy; in fact even line integrals of real functions is yet to be implemented in SciPy, which is surprising. Though we at present have SymPy for line Integrals, I doubt if there is any open-source python package supporting the calculation of Contour Integrals. With integrate module of SciPy already having been properly developed for definite integration, implementation of line as well as contour integrals, I presume; would not require work from scratch and shall be a challenging but fruitful project. 2. I really have no idea if the purpose of NumPy or SciPy would encompass this but we are yet to have indefinite integration. An implementation of that, though highly challenging, may open doors for innumerable other functions like the ones to calculate the Laplace transform, Hankel transform and many more. 3. As stated earlier, we have spherical harmonic functions (with much scope for dev) we are yet to have elliptical and cylindrical harmonic function, which may be developed. 4. Lastly, we are yet to have Inverse Laplace transforms which as Ralf has rightly pointed out it may be too challenging to implement. 5. Further reading the road-map given by Mr.Ralf, I would like to develop the Bluestein's FFT algorithm. Thanks for reading along till the end. I shall append to this mail as when I am struck with ideas. Please do give your valuable guidance -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Fri Jan 31 00:44:01 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Fri, 31 Jan 2014 06:44:01 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: By the way, it seems OpenBLAS builds with clang on MacOSX, so presumably it works on Windows as well. Unlike GNU toolchains, there is a cl-clang frontend which is supposed to be MSVC compatible. BTW, clang is a fantastic compiler, but little known among Windows users where MSVC and MinGW dominate. Sturla On 30/01/14 13:29, Carl Kleffner wrote: > I fully agree with you. But you have to consider the following: > > - the officially mingw-w64 toolchains are build almost the same way. The > only difference is, that they have non-static builds (that would be > preferable for C++ development BTW) > - you won't get the necessary addons like spec-files, manifest resource > files for msvcr90,100 from there. > - there is a urgent need for a free and portable C,C++, Fortran compiler > for Windows with full blas, lapack support. You won't get that with > numpy-MKL, but with a GNU toolchain and OpenBLAS. Not everyone can buy > the Intel Fortran compiler or is allowed to install it. > - you can build 3rd party extensions which use blas,lapack directly or > with cython with such a toolchain regardless if you use numpy/scipy-MKL > or mingw-based numpy/scipy > - The licence question of numpy-MKL is unclear. I know that MKL is > linked in statically. But can I redistribite it myself or use it in > commercial context without buying a Intel licence? > > Carl > > > 2014-01-30 Sturla Molden >: > > On 30/01/14 12:01, Carl Kleffner wrote: > > > My conclusion is: mixing different compiler architectures for > building > > Python extensions on Windows is possible but makes it necessary > to build > > a 'vendor' gcc toolchain. > > Right. > > This makes a nice twist on the infamous XML and Regex story: > > - There once was a man who had a problem building NumPy. Then he > thought, "I'll just use a custom compiler toolchain." Now he had two > problems. > > Setting up a custom GNU toolchain for NumPy on Windows would not be > robust enough. And when there be bugs, we have two places to look for > them instead of one. > > By using a tested and verified compiler toolchain, there is one place > less things can go wrong. I would rather consider distributing NumPy > binaries linked with MKL, if Intel's license allows it. > > Sturla > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > From chris.laumann at gmail.com Fri Jan 31 01:20:52 2014 From: chris.laumann at gmail.com (Chris Laumann) Date: Thu, 30 Jan 2014 23:20:52 -0700 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: Hi all- The following snippet appears to leak memory badly (about 10 MB per execution): P = randint(0,2,(30,13)) for i in range(50): ? ? print "\r", i, "/", 50 ? ? for ai in ndindex((2,)*13): ? ? ? ? j = np.sum(P.dot(ai)) If instead you execute (no np.sum call): P = randint(0,2,(30,13)) for i in range(50): ? ? print "\r", i, "/", 50 ? ? for ai in ndindex((2,)*13): ? ? ? ? j = P.dot(ai) There is no leak.? Any thoughts? I?m stumped. Best, Chris --? Chris Laumann Sent with Airmail -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Fri Jan 31 02:20:22 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Fri, 31 Jan 2014 08:20:22 +0100 Subject: [Numpy-discussion] MKL and OpenBLAS In-Reply-To: References: Message-ID: On 26/01/14 13:44, Dinesh Vadhia wrote:> This conversation gets discussed often with Numpy developers but since > the requirement for optimized Blas is pretty common these days, how > about distributing Numpy with OpenBlas by default? People who don't > want optimized BLAS or OpenBLAS can then edit the site.cfg file to > add/remove. I can never remember if Numpy comes with Atlas by default > but either way, if using MKL is not feasible because of its licensing > issues then Numpy has to be re-compiled with OpenBLAS (for example). > Why not make it easier for developers to use Numpy with an in-built > optimized Blas. > Btw, just in case some folks from Intel are listening: how about > releasing MKL binaries for all platforms for developers to do with it > what they want ie. free. You know it makes sense! There is an active discussion on this here: https://github.com/xianyi/OpenBLAS/issues/294 Sturla From jtaylor.debian at googlemail.com Fri Jan 31 04:29:34 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 31 Jan 2014 10:29:34 +0100 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: which version of numpy are you using? there seems to be a leak in the scalar return due to the PyObject_Malloc usage in git master, but it doesn't affect 1.8.0 On Fri, Jan 31, 2014 at 7:20 AM, Chris Laumann wrote: > Hi all- > > The following snippet appears to leak memory badly (about 10 MB per > execution): > > P = randint(0,2,(30,13)) > > for i in range(50): > print "\r", i, "/", 50 > for ai in ndindex((2,)*13): > j = np.sum(P.dot(ai)) > > If instead you execute (no np.sum call): > > P = randint(0,2,(30,13)) > > for i in range(50): > print "\r", i, "/", 50 > for ai in ndindex((2,)*13): > j = P.dot(ai) > > There is no leak. > > Any thoughts? I'm stumped. > > Best, Chris > > -- > Chris Laumann > Sent with Airmail > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan at sun.ac.za Fri Jan 31 05:34:59 2014 From: stefan at sun.ac.za (=?iso-8859-1?Q?St=E9fan?= van der Walt) Date: Fri, 31 Jan 2014 11:34:59 +0100 Subject: [Numpy-discussion] Suggestions for GSoC Projects In-Reply-To: References: Message-ID: <20140131103459.GA24791@gmail.com> On Fri, 31 Jan 2014 04:31:01 +0530, jennifer stone wrote: > 3. As stated earlier, we have spherical harmonic functions (with much scope > for dev) we are yet to have elliptical and cylindrical harmonic function, > which may be developed. As stated before, I am personally interested in seeing the spherical harmonics in SciPy improve. > 5. Further reading the road-map given by Mr.Ralf, I would like to develop > the Bluestein's FFT algorithm. https://gist.github.com/endolith/2783807 Regards St?fan From vanforeest at gmail.com Fri Jan 31 05:36:41 2014 From: vanforeest at gmail.com (nicky van foreest) Date: Fri, 31 Jan 2014 11:36:41 +0100 Subject: [Numpy-discussion] Suggestions for GSoC Projects In-Reply-To: References: Message-ID: Hi Jennifer, On 31 January 2014 00:01, jennifer stone wrote: > With GSoC 2014 being round the corner, I hereby put up few projects for > discussion that I would love to pursue as a student. > Guidance, suggestions are cordially welcome:- > > 1. If I am not mistaken, contour integration is not supported by SciPy; in > fact even line integrals of real functions is yet to be implemented in > SciPy, which is surprising. Though we at present have SymPy for line > Integrals, I doubt if there is any open-source python package supporting > the calculation of Contour Integrals. With integrate module of SciPy > already having been properly developed for definite integration, > implementation of line as well as contour integrals, I presume; would not > require work from scratch and shall be a challenging but fruitful project. > > 2. I really have no idea if the purpose of NumPy or SciPy would encompass > this but we are yet to have indefinite integration. An implementation of > that, though highly challenging, may open doors for innumerable other > functions like the ones to calculate the Laplace transform, Hankel > transform and many more. > > 3. As stated earlier, we have spherical harmonic functions (with much > scope for dev) we are yet to have elliptical and cylindrical harmonic > function, which may be developed. > > 4. Lastly, we are yet to have Inverse Laplace transforms which as Ralf has > rightly pointed out it may be too challenging to implement. > I once ported a method of Abate and Whitt to python. My aim was not to produce the nicest python implementation, but to stick closely to the code of Abate and Whitt in their paper. However, it might a useful starting point. http://nicky.vanforeest.com/queueing/euler/euler.html?highlight=laplace Nicky > > 5. Further reading the road-map given by Mr.Ralf, I would like to develop > the Bluestein's FFT algorithm. > > Thanks for reading along till the end. I shall append to this mail as when > I am struck with ideas. Please do give your valuable guidance > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.laumann at gmail.com Fri Jan 31 10:14:14 2014 From: chris.laumann at gmail.com (Chris Laumann) Date: Fri, 31 Jan 2014 08:14:14 -0700 Subject: [Numpy-discussion] Memory leak? In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 31 10:37:46 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 31 Jan 2014 15:37:46 +0000 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: On Fri, Jan 31, 2014 at 3:14 PM, Chris Laumann wrote: > > Current scipy superpack for osx so probably pretty close to master. What does numpy.__version__ say? -n From ben.root at ou.edu Fri Jan 31 11:29:10 2014 From: ben.root at ou.edu (Benjamin Root) Date: Fri, 31 Jan 2014 11:29:10 -0500 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: Just to chime in here about the SciPy Superpack... this distribution tracks the master branch of many projects, and then puts out releases, on the assumption that master contains pristine code, I guess. I have gone down strange rabbit holes thinking that a particular bug was fixed already and the user telling me a version number that would confirm that, only to discover that the superpack actually packaged matplotlib about a month prior to releasing a version. I will not comment on how good or bad of an idea it is for the Superpack to do that, but I just wanted to make other developers aware of this to keep them from falling down the same rabbit hole. Cheers! Ben Root On Fri, Jan 31, 2014 at 10:37 AM, Nathaniel Smith wrote: > On Fri, Jan 31, 2014 at 3:14 PM, Chris Laumann > wrote: > > > > Current scipy superpack for osx so probably pretty close to master. > > What does numpy.__version__ say? > > -n > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jan 31 12:12:21 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 31 Jan 2014 17:12:21 +0000 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: On Fri, Jan 31, 2014 at 4:29 PM, Benjamin Root wrote: > Just to chime in here about the SciPy Superpack... this distribution tracks > the master branch of many projects, and then puts out releases, on the > assumption that master contains pristine code, I guess. I have gone down > strange rabbit holes thinking that a particular bug was fixed already and > the user telling me a version number that would confirm that, only to > discover that the superpack actually packaged matplotlib about a month prior > to releasing a version. > > I will not comment on how good or bad of an idea it is for the Superpack to > do that, but I just wanted to make other developers aware of this to keep > them from falling down the same rabbit hole. Wow, that is good to know. Esp. since the web page: http://fonnesbeck.github.io/ScipySuperpack/ simply advertises that it gives you things like numpy 1.9 and scipy 0.14, which don't exist. (With some note about dev versions buried in prose a few sentences later.) Empirically, development versions of numpy have always contained bugs, regressions, and compatibility breaks that were fixed in the released version; and we make absolutely no guarantees about compatibility between dev versions and any release versions. And it sort of has to be that way for us to be able to make progress. But if too many people start using dev versions for daily use, then we and downstream dependencies will have to start adding compatibility hacks and stuff to support those dev versions. Which would be a nightmare for developers and users both. Recommending this build for daily use by non-developers strikes me as dangerous for both users and the wider ecosystem. -n From jtaylor.debian at googlemail.com Fri Jan 31 12:31:28 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Fri, 31 Jan 2014 18:31:28 +0100 Subject: [Numpy-discussion] Memory leak? In-Reply-To: References: Message-ID: <52EBDDF0.2040006@googlemail.com> On 31.01.2014 18:12, Nathaniel Smith wrote: > On Fri, Jan 31, 2014 at 4:29 PM, Benjamin Root wrote: >> Just to chime in here about the SciPy Superpack... this distribution tracks >> the master branch of many projects, and then puts out releases, on the >> assumption that master contains pristine code, I guess. I have gone down >> strange rabbit holes thinking that a particular bug was fixed already and >> the user telling me a version number that would confirm that, only to >> discover that the superpack actually packaged matplotlib about a month prior >> to releasing a version. >> >> I will not comment on how good or bad of an idea it is for the Superpack to >> do that, but I just wanted to make other developers aware of this to keep >> them from falling down the same rabbit hole. > > Wow, that is good to know. Esp. since the web page: > http://fonnesbeck.github.io/ScipySuperpack/ > simply advertises that it gives you things like numpy 1.9 and scipy > 0.14, which don't exist. (With some note about dev versions buried in > prose a few sentences later.) > > Empirically, development versions of numpy have always contained bugs, > regressions, and compatibility breaks that were fixed in the released > version; and we make absolutely no guarantees about compatibility > between dev versions and any release versions. And it sort of has to > be that way for us to be able to make progress. But if too many people > start using dev versions for daily use, then we and downstream > dependencies will have to start adding compatibility hacks and stuff > to support those dev versions. Which would be a nightmare for > developers and users both. > > Recommending this build for daily use by non-developers strikes me as > dangerous for both users and the wider ecosystem. > while probably not good for the user I think its very good for us. This is the second bug I introduced found by superpack users. This one might have gone unnoticed into the next release as it is pretty much impossible to find via tests. Even in valgrind reports its hard to find as its lumped in with all of pythons hundreds of memory arena still-reachable leaks. Concerning the fix, it seems if python sees tp_free == PYObject_Del/Free it replaces it with the tp_free of the base type which is int_free in this case. int_free uses a special allocator for even lower overhead so we start leaking. We either need to find the right flag to set for our scalars so it stops doing that, add an indirection so the function pointers don't match or stop using the object allocator as we are apparently digging to deep into pythons internal implementation details by doing so. From charlesr.harris at gmail.com Fri Jan 31 12:40:32 2014 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 31 Jan 2014 10:40:32 -0700 Subject: [Numpy-discussion] Suggestions for GSoC Projects In-Reply-To: References: Message-ID: On Thu, Jan 30, 2014 at 4:01 PM, jennifer stone wrote: > With GSoC 2014 being round the corner, I hereby put up few projects for > discussion that I would love to pursue as a student. > Guidance, suggestions are cordially welcome:- > > 1. If I am not mistaken, contour integration is not supported by SciPy; in > fact even line integrals of real functions is yet to be implemented in > SciPy, which is surprising. Though we at present have SymPy for line > Integrals, I doubt if there is any open-source python package supporting > the calculation of Contour Integrals. With integrate module of SciPy > already having been properly developed for definite integration, > implementation of line as well as contour integrals, I presume; would not > require work from scratch and shall be a challenging but fruitful project. > > No comment, as I don't use this functionality. I don't know how many folks would want this. > 2. I really have no idea if the purpose of NumPy or SciPy would encompass > this but we are yet to have indefinite integration. An implementation of > that, though highly challenging, may open doors for innumerable other > functions like the ones to calculate the Laplace transform, Hankel > transform and many more. > > 3. As stated earlier, we have spherical harmonic functions (with much > scope for dev) we are yet to have elliptical and cylindrical harmonic > function, which may be developed. > This sounds very doable. How much work do you think would be involved? > > 4. Lastly, we are yet to have Inverse Laplace transforms which as Ralf has > rightly pointed out it may be too challenging to implement. > > This is more ambitious, I'm not in a position to comment on whether it is doable in the summer time frame. > 5. Further reading the road-map given by Mr.Ralf, I would like to develop > the Bluestein's FFT algorithm. > > This one could be quite involved, but useful. The problem is not so much *a* Bluestein FFT, but combining it with the current FFTPACK so that factors other than 2,3,4, or 5 are handled with the Bluestein algorithm. FFTPACK is in Fortran and not very well documented. I wouldn't recommend this project unless you are pretty familiar with FFTs and Fortran. It is unfortunate that the latest versions of FFTPACK are GPL. A BSD licensed package that already implements the Bluestein algorithm for FFTs is Parallel Colt, which is in Java but could maybe be translated. A similar but smaller project, not involving integration with the general FFT, would be a stand alone chirpz transform, might be too easy though ;) > Thanks for reading along till the end. I shall append to this mail as when > I am struck with ideas. Please do give your valuable guidance > > Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: