From jtaylor.debian at googlemail.com Mon Apr 3 07:28:22 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Mon, 3 Apr 2017 13:28:22 +0200 Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects (sponsored & affiliated) In-Reply-To: <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com> References: <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com> <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com> <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com> Message-ID: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com> On 31.03.2017 16:07, Julian Taylor wrote: > On 31.03.2017 15:51, Nathaniel Smith wrote: >> On Mar 31, 2017 1:15 AM, "Ralf Gommers" > > wrote: >> >> >> >> On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers >> > wrote: >> >> >> >> On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor >> > > wrote: >> >> I have two ideas under one big important topic: make numpy >> python3 >> compatible. >> >> The first fits pretty well with the grant size and nobody >> wants to do it >> for free: >> - fix our text IO functions under python3 and support multiple >> encodings, not only latin1. >> Reasonably simple to do, slap encoding arguments on the >> functions, >> generate test cases and somehow keep backward compatibility. >> Some >> prelimary unfinished work is in >> https://github.com/numpy/numpy/pull/4208 >> >> >> >> I like that idea, it's a recurring pain point. Are you >> interested to work on it, or are you thinking to advertise the >> idea here to see if anyone steps up? >> >> >> More thoughts on this anyone? Or preferences for this idea or the >> numpy.org one? Submission deadline is April 3rd >> and we can only put in one proposal this time, so we need to (a) >> make a choice between these ideas, and (b) write up a proposal. >> >> If there's not enough replies to this so the choice is clear cut, I >> will send out a poll to the core devs. >> >> >> Do we have anyone interested in doing the work in either case? That >> seems like the most important consideration to me... >> >> -n >> > > I could do the textio thing if no one shows up for numpy.org. I can > probably check again what is required in the next few days and write a > proposal. > The change will need reviewing in the end too, should that be > compensated too? It feels weird if not. > I have decided to not do it, as it is more or less just a bugfix and I currently do not feel capable of doing with added completion pressure. But I have collected some of related issues and discussions: https://github.com/numpy/numpy/issues/4600 https://github.com/numpy/numpy/issues/3184 http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003 # loadtxt https://github.com/numpy/numpy/pull/4208 # genfromtxt http://numpy-discussion.10968.n7.nabble.com/genfromtxt-universal-newline-support-td37816.html https://github.com/dhomeier/numpy/commit/995ec93 From renato.fabbri at gmail.com Mon Apr 3 08:14:56 2017 From: renato.fabbri at gmail.com (Renato Fabbri) Date: Mon, 3 Apr 2017 09:14:56 -0300 Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects (sponsored & affiliated) In-Reply-To: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com> References: <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com> <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com> <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com> <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com> Message-ID: maybe OT, but is has become recurrent to me for already some years to make a very simple module for obtaining arrays related to musical elements. All here: https://github.com/ttm/dissertacao scripts/ have Python/Numpy has implementions of the musical elements. dissertacaoCorrigida.pdf holds a thorough description of the framework. I idealize it as a module inside Numpy but I understand it might be reasonable to do it as a Scipy kit. I handed my doctorate a few days ago and might be willing to put some time into this. PS. long time no post. Hello! On Mon, Apr 3, 2017 at 8:28 AM, Julian Taylor wrote: > On 31.03.2017 16:07, Julian Taylor wrote: > > On 31.03.2017 15:51, Nathaniel Smith wrote: > >> On Mar 31, 2017 1:15 AM, "Ralf Gommers" >> > wrote: > >> > >> > >> > >> On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers > >> > wrote: > >> > >> > >> > >> On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor > >> >> > wrote: > >> > >> I have two ideas under one big important topic: make numpy > >> python3 > >> compatible. > >> > >> The first fits pretty well with the grant size and nobody > >> wants to do it > >> for free: > >> - fix our text IO functions under python3 and support > multiple > >> encodings, not only latin1. > >> Reasonably simple to do, slap encoding arguments on the > >> functions, > >> generate test cases and somehow keep backward compatibility. > >> Some > >> prelimary unfinished work is in > >> https://github.com/numpy/numpy/pull/4208 > >> > >> > >> > >> I like that idea, it's a recurring pain point. Are you > >> interested to work on it, or are you thinking to advertise the > >> idea here to see if anyone steps up? > >> > >> > >> More thoughts on this anyone? Or preferences for this idea or the > >> numpy.org one? Submission deadline is April 3rd > >> and we can only put in one proposal this time, so we need to (a) > >> make a choice between these ideas, and (b) write up a proposal. > >> > >> If there's not enough replies to this so the choice is clear cut, I > >> will send out a poll to the core devs. > >> > >> > >> Do we have anyone interested in doing the work in either case? That > >> seems like the most important consideration to me... > >> > >> -n > >> > > > > I could do the textio thing if no one shows up for numpy.org. I can > > probably check again what is required in the next few days and write a > > proposal. > > The change will need reviewing in the end too, should that be > > compensated too? It feels weird if not. > > > > I have decided to not do it, as it is more or less just a bugfix and I > currently do not feel capable of doing with added completion pressure. > But I have collected some of related issues and discussions: > > https://github.com/numpy/numpy/issues/4600 > https://github.com/numpy/numpy/issues/3184 > http://numpy-discussion.10968.n7.nabble.com/using-loadtxt- > to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003 > # loadtxt > https://github.com/numpy/numpy/pull/4208 > # genfromtxt > http://numpy-discussion.10968.n7.nabble.com/genfromtxt- > universal-newline-support-td37816.html > https://github.com/dhomeier/numpy/commit/995ec93 > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Renato Fabbri GNU/Linux User #479299 labmacambira.sourceforge.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.haessig at crans.org Mon Apr 3 09:20:56 2017 From: pierre.haessig at crans.org (Pierre Haessig) Date: Mon, 3 Apr 2017 15:20:56 +0200 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: Hello, Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit : > [....] > > But how come Julia is 4-5x faster since Numpy uses C implementation > for the entire process ? (Mersenne Twister -> uniform double -> > Box-Muller transform to get a Gaussian > https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c). > Also I noticed that Julia uses a different algorithm (Ziggurat Method > from Marsaglia and Tsang , > https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700) > but this doesn't explain the difference for uniform rng. > Any ideas? Do you think Stackoverflow would be a better place for my question? best, Pierre -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaime.frio at gmail.com Mon Apr 3 09:44:36 2017 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Mon, 3 Apr 2017 15:44:36 +0200 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: On Mon, Apr 3, 2017 at 3:20 PM, Pierre Haessig wrote: > Hello, > Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit : > > [....] > > But how come Julia is 4-5x faster since Numpy uses C implementation for > the entire process ? (Mersenne Twister -> uniform double -> Box-Muller > transform to get a Gaussian https://github.com/numpy/ > numpy/blob/master/numpy/random/mtrand/randomkit.c). Also I noticed that > Julia uses a different algorithm (Ziggurat Method from Marsaglia and Tsang > , https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700) but > this doesn't explain the difference for uniform rng. > > Any ideas? > This says that Julia uses this library , which is different from the home brewed version of the Mersenne twister in NumPy. The second link I posted claims their speed comes from generating double precision numbers directly, rather than generating random bytes that have to be converted to doubles, as is the case of NumPy through this magical incantation . They also throw the SIMD acronym around, which likely means their random number generation is parallelized. My guess is that most of the speed-up comes from the SIMD parallelization: the Mersenne algorithm does a lot of work to produce 32 random bits, so that likely dominates over a couple of arithmetic operations, even if divisions are involved. Jaime Do you think Stackoverflow would be a better place for my question? > > best, > > Pierre > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Mon Apr 3 09:52:06 2017 From: ndbecker2 at gmail.com (Neal Becker) Date: Mon, 03 Apr 2017 13:52:06 +0000 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: Take a look here: https://bashtage.github.io/ng-numpy-randomstate/doc/index.html On Mon, Apr 3, 2017 at 9:45 AM Jaime Fern?ndez del R?o wrote: > On Mon, Apr 3, 2017 at 3:20 PM, Pierre Haessig > wrote: > > Hello, > Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit : > > [....] > > But how come Julia is 4-5x faster since Numpy uses C implementation for > the entire process ? (Mersenne Twister -> uniform double -> Box-Muller > transform to get a Gaussian > https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c). > Also I noticed that Julia uses a different algorithm (Ziggurat Method > from Marsaglia and Tsang , > https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700) but > this doesn't explain the difference for uniform rng. > > Any ideas? > > > This > says > that Julia uses this library > , which is > different from the home brewed version of the Mersenne twister in NumPy. > The second link I posted claims their speed comes from generating double > precision numbers directly, rather than generating random bytes that have > to be converted to doubles, as is the case of NumPy through this magical > incantation > . > They also throw the SIMD acronym around, which likely means their random > number generation is parallelized. > > My guess is that most of the speed-up comes from the SIMD parallelization: > the Mersenne algorithm does a lot of work > to > produce 32 random bits, so that likely dominates over a couple of > arithmetic operations, even if divisions are involved. > > Jaime > > Do you think Stackoverflow would be a better place for my question? > > best, > > Pierre > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes > de dominaci?n mundial. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.haessig at crans.org Mon Apr 3 11:46:58 2017 From: pierre.haessig at crans.org (Pierre Haessig) Date: Mon, 3 Apr 2017 17:46:58 +0200 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: Le 03/04/2017 ? 15:52, Neal Becker a ?crit : > Take a look here: > https://bashtage.github.io/ng-numpy-randomstate/doc/index.html Thanks for the pointer. A very feature-full random generator package. So it is indeed possible to have in Python/Numpy both the "advanced" Mersenne Twister (dSFMT) at the lower level and the Ziggurat algorithm for Gaussian transform on top. Perfect! In an ideal world, this would be implemented by default in Numpy, but I understand that this would break the reproducibility of existing codes. best, Pierre From ndbecker2 at gmail.com Mon Apr 3 11:49:19 2017 From: ndbecker2 at gmail.com (Neal Becker) Date: Mon, 03 Apr 2017 15:49:19 +0000 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: I think the intention is that this is the next gen of numpy randomstate, and will eventually be merged in. On Mon, Apr 3, 2017 at 11:47 AM Pierre Haessig wrote: > > Le 03/04/2017 ? 15:52, Neal Becker a ?crit : > > Take a look here: > > https://bashtage.github.io/ng-numpy-randomstate/doc/index.html > Thanks for the pointer. A very feature-full random generator package. > > So it is indeed possible to have in Python/Numpy both the "advanced" > Mersenne Twister (dSFMT) at the lower level and the Ziggurat algorithm > for Gaussian transform on top. Perfect! > > In an ideal world, this would be implemented by default in Numpy, but I > understand that this would break the reproducibility of existing codes. > > best, > Pierre > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.haessig at crans.org Mon Apr 3 11:59:22 2017 From: pierre.haessig at crans.org (Pierre Haessig) Date: Mon, 3 Apr 2017 17:59:22 +0200 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: Le 03/04/2017 ? 15:44, Jaime Fern?ndez del R?o a ?crit : > This > says > that Julia uses this library > , which > is different from the home brewed version of the Mersenne twister in > NumPy. The second link I posted claims their speed comes from > generating double precision numbers directly, rather than generating > random bytes that have to be converted to doubles, as is the case of > NumPy through this magical incantation > . > They also throw the SIMD acronym around, which likely means their > random number generation is parallelized. > > My guess is that most of the speed-up comes from the SIMD > parallelization: the Mersenne algorithm does a lot of work > to > produce 32 random bits, so that likely dominates over a couple of > arithmetic operations, even if divisions are involved. Thanks for the feedback. I'm not good in enough in reading Julia to be 100% sure, but I feel like that the random.jl (https://github.com/JuliaLang/julia/blob/master/base/random.jl) contains a Julia implementation of Mersenne Twister... but I have no idea whether it is the "fancy" SIMD version or the "old" 32bits version. best, Pierre -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Apr 3 12:33:13 2017 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 3 Apr 2017 09:33:13 -0700 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: On Apr 3, 2017 8:59 AM, "Pierre Haessig" wrote: Le 03/04/2017 ? 15:44, Jaime Fern?ndez del R?o a ?crit : This says that Julia uses this library , which is different from the home brewed version of the Mersenne twister in NumPy. The second link I posted claims their speed comes from generating double precision numbers directly, rather than generating random bytes that have to be converted to doubles, as is the case of NumPy through this magical incantation . They also throw the SIMD acronym around, which likely means their random number generation is parallelized. My guess is that most of the speed-up comes from the SIMD parallelization: the Mersenne algorithm does a lot of work to produce 32 random bits, so that likely dominates over a couple of arithmetic operations, even if divisions are involved. Thanks for the feedback. I'm not good in enough in reading Julia to be 100% sure, but I feel like that the random.jl (https://github.com/JuliaLang/ julia/blob/master/base/random.jl) contains a Julia implementation of Mersenne Twister... but I have no idea whether it is the "fancy" SIMD version or the "old" 32bits version. That code contains many references to "dSFMT", which is the name of the "fancy" algorithm. IIUC dSFMT is related to the mersenne twister but is actually a different generator altogether -- advertising that Julia uses the mersenne twister is somewhat misleading IMHO. Of course this is really the fault of the algorithm's designers for creating multiple algorithms that have "mersenne twister" as part of their names... -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.haessig at crans.org Mon Apr 3 12:33:16 2017 From: pierre.haessig at crans.org (Pierre Haessig) Date: Mon, 3 Apr 2017 18:33:16 +0200 Subject: [Numpy-discussion] speed of random number generator compared to Julia In-Reply-To: References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org> Message-ID: Le 03/04/2017 ? 17:49, Neal Becker a ?crit : > I think the intention is that this is the next gen of numpy > randomstate, and will eventually be merged in. Ah yes, I found the related issue in the meantime: https://github.com/numpy/numpy/issues/6967 Thanks again for the pointers. Pierre From ralf.gommers at gmail.com Mon Apr 3 16:22:29 2017 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 4 Apr 2017 08:22:29 +1200 Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects (sponsored & affiliated) In-Reply-To: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com> References: <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com> <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com> <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com> <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com> Message-ID: On Mon, Apr 3, 2017 at 11:28 PM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > On 31.03.2017 16:07, Julian Taylor wrote: > > On 31.03.2017 15:51, Nathaniel Smith wrote: > >> On Mar 31, 2017 1:15 AM, "Ralf Gommers" >> > wrote: > >> > >> > >> > >> On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers > >> > wrote: > >> > >> > >> > >> On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor > >> >> > wrote: > >> > >> I have two ideas under one big important topic: make numpy > >> python3 > >> compatible. > >> > >> The first fits pretty well with the grant size and nobody > >> wants to do it > >> for free: > >> - fix our text IO functions under python3 and support > multiple > >> encodings, not only latin1. > >> Reasonably simple to do, slap encoding arguments on the > >> functions, > >> generate test cases and somehow keep backward compatibility. > >> Some > >> prelimary unfinished work is in > >> https://github.com/numpy/numpy/pull/4208 > >> > >> > >> > >> I like that idea, it's a recurring pain point. Are you > >> interested to work on it, or are you thinking to advertise the > >> idea here to see if anyone steps up? > >> > >> > >> More thoughts on this anyone? Or preferences for this idea or the > >> numpy.org one? Submission deadline is April 3rd > >> and we can only put in one proposal this time, so we need to (a) > >> make a choice between these ideas, and (b) write up a proposal. > >> > >> If there's not enough replies to this so the choice is clear cut, I > >> will send out a poll to the core devs. > >> > >> > >> Do we have anyone interested in doing the work in either case? That > >> seems like the most important consideration to me... > Fair enough. Had a plan, but my weekend went a bit different than planned so couldn't follow up on it. > >> > >> -n > >> > > > > I could do the textio thing if no one shows up for numpy.org. I can > > probably check again what is required in the next few days and write a > > proposal. > > The change will need reviewing in the end too, should that be > > compensated too? It feels weird if not. > > > > I have decided to not do it, as it is more or less just a bugfix and I > currently do not feel capable of doing with added completion pressure. > Good call Julian. I struggled with the same thing - had a designer to do the numpy.org work, but that still needed someone to do the content, review, etc. Decided not to try to take that on, because I'm already struggling to keep up. > But I have collected some of related issues and discussions: > Thanks, I'm sure that'll be of use at some point. Ralf > > https://github.com/numpy/numpy/issues/4600 > https://github.com/numpy/numpy/issues/3184 > http://numpy-discussion.10968.n7.nabble.com/using-loadtxt- > to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003 > # loadtxt > https://github.com/numpy/numpy/pull/4208 > # genfromtxt > http://numpy-discussion.10968.n7.nabble.com/genfromtxt- > universal-newline-support-td37816.html > https://github.com/dhomeier/numpy/commit/995ec93 > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mads.ipsen at gmail.com Tue Apr 4 03:14:07 2017 From: mads.ipsen at gmail.com (Mads Ipsen) Date: Tue, 4 Apr 2017 09:14:07 +0200 Subject: [Numpy-discussion] bitwise or'ing rows Message-ID: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com> Hi If I have an n x m array of bools, is there a handy way for me to perform a 'bitwise_and' or 'bitwise_or' along an axis, for example all the rows or all the columns? For example a = [[1,0,0,0], [0,0,1,0], [0,0,0,0]] (0 and 1 meaning True and False) a.bitwise_or(axis=0) giving [1,0,1,0] Best regards, Mads -- +---------------------------------------------------------------------+ | Mads Ipsen | +----------------------------------+----------------------------------+ | Overgaden Oven Vandet 106, 4.tv | phone: +45-29716388 | | DK-1415 K?benhavn K | email: mads.ipsen at gmail.com | | Denmark | map : https://goo.gl/maps/oQ6y6 | +----------------------------------+----------------------------------+ From jaime.frio at gmail.com Tue Apr 4 03:49:37 2017 From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=) Date: Tue, 4 Apr 2017 09:49:37 +0200 Subject: [Numpy-discussion] bitwise or'ing rows In-Reply-To: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com> References: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com> Message-ID: On Tue, Apr 4, 2017 at 9:14 AM, Mads Ipsen wrote: > Hi > > If I have an n x m array of bools, is there a handy way for me to perform > a 'bitwise_and' or 'bitwise_or' along an axis, for example all the rows or > all the columns? For example > > a = > [[1,0,0,0], > [0,0,1,0], > [0,0,0,0]] (0 and 1 meaning True and False) > > a.bitwise_or(axis=0) > > giving > > [1,0,1,0] > I think what you want is equivalent to np.all(a, axis=0) for bitwise_and and np.any(a, axis=0) for bitwise_or. You can also use the more verbose np.bitwise_and.reduce(a, axis=0) and np.bitwise_or.reduce(a, axis=0). Jaime > > Best regards, > > Mads > > > > > -- > +---------------------------------------------------------------------+ > | Mads Ipsen | > +----------------------------------+----------------------------------+ > | Overgaden Oven Vandet 106, 4.tv | phone: +45-29716388 | > | DK-1415 K?benhavn K | email: mads.ipsen at gmail.com | > | Denmark | map : https://goo.gl/maps/oQ6y6 | > +----------------------------------+----------------------------------+ > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mads.ipsen at gmail.com Tue Apr 4 03:52:13 2017 From: mads.ipsen at gmail.com (Mads Ipsen) Date: Tue, 4 Apr 2017 09:52:13 +0200 Subject: [Numpy-discussion] bitwise or'ing rows In-Reply-To: References: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com> Message-ID: Thanks! On 04/04/2017 09:49 AM, Jaime Fern?ndez del R?o wrote: > On Tue, Apr 4, 2017 at 9:14 AM, Mads Ipsen > wrote: > > Hi > > If I have an n x m array of bools, is there a handy way for me to > perform a 'bitwise_and' or 'bitwise_or' along an axis, for example > all the rows or all the columns? For example > > a = > [[1,0,0,0], > [0,0,1,0], > [0,0,0,0]] (0 and 1 meaning True and False) > > a.bitwise_or(axis=0) > > giving > > [1,0,1,0] > > > I think what you want is equivalent to np.all(a, axis=0) for bitwise_and > and np.any(a, axis=0) for bitwise_or. > > You can also use the more verbose np.bitwise_and.reduce(a, axis=0) and > np.bitwise_or.reduce(a, axis=0). > > Jaime > > > > Best regards, > > Mads > > > > > -- > +---------------------------------------------------------------------+ > | Mads Ipsen | > +----------------------------------+----------------------------------+ > | Overgaden Oven Vandet 106, 4.tv | phone: > +45-29716388 | > | DK-1415 K?benhavn K | email: > mads.ipsen at gmail.com | > | Denmark | map : https://goo.gl/maps/oQ6y6 | > +----------------------------------+----------------------------------+ > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus > planes de dominaci?n mundial. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- +---------------------------------------------------------------------+ | Mads Ipsen | +----------------------------------+----------------------------------+ | Overgaden Oven Vandet 106, 4.tv | phone: +45-29716388 | | DK-1415 K?benhavn K | email: mads.ipsen at gmail.com | | Denmark | map : https://goo.gl/maps/oQ6y6 | +----------------------------------+----------------------------------+ From charlesr.harris at gmail.com Sat Apr 8 15:45:30 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 8 Apr 2017 13:45:30 -0600 Subject: [Numpy-discussion] __array_ufunc__ Message-ID: Hi All, After a week of review and rework, the new and improved __array_ufunc__ has turned the corner and is headed down the homestretch. Now is the time for interested parties to give it a final lookover at https://github.com/numpy/numpy/pull/8247. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Sun Apr 9 07:27:17 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sun, 9 Apr 2017 13:27:17 +0200 Subject: [Numpy-discussion] call for testing: unicode loadtxt/genfromtxt Message-ID: <027728fd-3619-6e67-da0a-3e591020dcd8@googlemail.com> hi, It has been very very long overdue but we finally have an attempt of making our text io functions actually use text IO instead of bytes IO. This means genfromtxt, loadtxt, fromregex and savetxt should support unicode input files of any python supported encoding and universal newlines. This is the first stepping stone to finally making numpy python3 compatible. The code is available in: https://github.com/numpy/numpy/pull/4208 Great effort has been spent to keep it backward compatible but we only have our testsuite as a reference which for sure does not cover all of the workarounds employed for this issue in the last 8 years. So we need people to dig out their ugliest hacks and test if they still work with this changeset. Functions that need testing are: loadtxt genfromtxt fromregex savetxt Test on any input that worked in older versions of numpy (including gzip compressed) and inputs that did not work because they where encoded in something other than latin1 or had issues with linebreaks. The PR adds an encoding keyword argument to all functions dealing with text input and output. All streams opened by the function have been changed from byte streams to text streams. As previously only latin1 encoded byte streams were supported, all input bytestreams are still decoded as such. Converters added by the user may have been relying on the input to them being bytes. To deal with that the default encoding argument is 'bytes' which corresponds to the default encoding (None) and enables conversion to latin1 encoded bytes before passing to user converters. If you want to use converters based on strings now you have to explicitly set encoding to something else (e.g. None). Currently the functions do not support the newlines keyword argument the python IO strings support. This probably will still get added. Related issues and discussions: https://github.com/numpy/numpy/issues/4600 https://github.com/numpy/numpy/issues/3184 https://github.com/numpy/numpy/issues/4939 https://github.com/numpy/numpy/issues/4543 http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003 http://numpy-discussion.10968.n7.nabble.com/genfromtxt-universal-newline-support-td37816.html https://github.com/dhomeier/numpy/commit/995ec93 cheers, Julian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From divenex at gmail.com Thu Apr 13 07:02:33 2017 From: divenex at gmail.com (Dive Nex) Date: Thu, 13 Apr 2017 12:02:33 +0100 Subject: [Numpy-discussion] Fixing inconsistent behaviour of reduceat() Message-ID: Hi all, I would like to try to reach a consensus about a long standing inconsistent behavior of reduceat() reported and discussed here https://github.com/numpy/numpy/issues/834 In summary, it seems an elegant and logical design choice, that all users will expect, for out = ufunc.reduceat(a, indices) to produce, for all indices j (except for the last one) the following out[j] = ufunc.reduce(a[indices[j]:indices[j+1]]) However the current documented and actual behavior is for the case indices[i] >= indices[i+1] to return simply out[j] = a[indices[i]] I cannot see any application where this behavior is useful or where this choice makes sense. This seems just a bug that should be fixed. What do people think? PS: A quick fix for the current implementation is out = ufunc.reduceat(a, indices) out *= np.diff(indices) > 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Thu Apr 13 11:03:56 2017 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 13 Apr 2017 11:03:56 -0400 Subject: [Numpy-discussion] Fixing inconsistent behaviour of reduceat() In-Reply-To: References: Message-ID: Discussion on-going at the above issue, but perhaps worth mentioning more broadly the alternative of adding a slice argument (or start, stop, step arguments) to ufunc.reduce, which would mean we can just deprecate reduceat altogether, as most use of it would just be add.reduce(array, slice=slice(indices[:-1], indices[1:]) (where now we are free to make the behaviour match what is expected for an empty slice) Here, one would broadcast the slice if it were 0-d, and could pass in tuples of slices if a tuple of axes was used. -- Marten From charlesr.harris at gmail.com Fri Apr 14 20:19:56 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 14 Apr 2017 18:19:56 -0600 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 Message-ID: Hi All, It may be early to discuss dropping support for Python 2.7, but there is a disturbance in the force that suggests that it might be worth looking forward to the year 2020 when Python itself will drop support for 2.7. There is also a website, http://www.python3statement.org, where several projects in the scientific python stack have pledged to be Python 2.7 free by that date. Given that, a preliminary discussion of the subject might be interesting, if only to gather information of where the community currently stands. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Apr 15 01:19:42 2017 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 14 Apr 2017 22:19:42 -0700 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris wrote: > Hi All, > > It may be early to discuss dropping support for Python 2.7, but there is a > disturbance in the force that suggests that it might be worth looking > forward to the year 2020 when Python itself will drop support for 2.7. There > is also a website, http://www.python3statement.org, where several projects > in the scientific python stack have pledged to be Python 2.7 free by that > date. Given that, a preliminary discussion of the subject might be > interesting, if only to gather information of where the community currently > stands. One reasonable position would that numpy releases that happen while 2.7 is supported upstream will also support 2.7, and releases after that won't. >From numpy's perspective, I feel like the most important reason to continue supporting 2.7 is our ability to convince people to keep upgrading. (Not the only reason, but the most important.) What I mean is: if we dropped 2.7 support tomorrow then it wouldn't actually make numpy unavailable on python 2.7; it would just mean that lots of users stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the end of the world ? numpy is mature software and 1.12 works pretty well. The big problem IMO would be if this then meant that lots of downstream projects felt that they had to continue supporting 1.12 going forward, which makes it very difficult for us to effectively ship new features or even bug fixes ? I mean, we can ship them, but no-one will use them. And if a downstream project finds a bug in numpy and can't upgrade numpy, then the tendency is to work around it instead of reporting it upstream. I think this is the main thing we want to avoid. This kind of means that we're at the mercy of downstream projects, though ? if scipy/pandas/etc. decide they want to support 2.7 until 2022, it might be in our best interest to do the same. But there's a collective action problem here: we want to keep supporting 2.7 so long as they do, but at the same time they may feel they need to keep supporting 2.7 as long as we do. And all of us would prefer to drop 2.7 support sooner rather than later, but we might all get stuck because we're waiting for someone else to move first. So my suggestion would be that numpy make some official announcement that our plan is to drop support for python 2 immediately after cpython upstream does. If worst comes to worst we can always decide to extend it at the time... but if we make the announcement now, then it's less likely that we'll need to :-). Another interesting project to look at here is django, since they occupy a similar place in the ecosystem (e.g. last I checked numpy and django are the two most-imported python packages on github): https://www.djangoproject.com/weblog/2015/jun/25/roadmap/ Their approach isn't directly applicable, because unlike us they have a strict time-based release schedule, defined support period for each release, and a distinction between regular and long-term support releases, where regular releases act sort of like pre-releases-on-steroids for the next LTS release. But basically what they settled on is philosophically similar to what I'm suggesting: they don't want an LTS to be supporting 2.7 beyond when cpython is supporting it. Then on top of that they don't want to support 2.7 in the regular releases leading up to that LTS either, so the net effect is that their last release with 2.7 support came out last week, and it will be supported until 2020 :-). And another useful precedent I think is that they announced this two years ago, back in 2015; if we make an announcement now, we'll be be giving a similar amount of warning. -n -- Nathaniel J. Smith -- https://vorpus.org From ralf.gommers at gmail.com Sat Apr 15 01:47:31 2017 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sat, 15 Apr 2017 17:47:31 +1200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith wrote: > On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris > wrote: > > Hi All, > > > > It may be early to discuss dropping support for Python 2.7, but there is > a > > disturbance in the force that suggests that it might be worth looking > > forward to the year 2020 when Python itself will drop support for 2.7. > There > > is also a website, http://www.python3statement.org, where several > projects > > in the scientific python stack have pledged to be Python 2.7 free by that > > date. Given that, a preliminary discussion of the subject might be > > interesting, if only to gather information of where the community > currently > > stands. > > One reasonable position would that numpy releases that happen while > 2.7 is supported upstream will also support 2.7, and releases after > that won't. > > From numpy's perspective, I feel like the most important reason to > continue supporting 2.7 is our ability to convince people to keep > upgrading. (Not the only reason, but the most important.) What I mean > is: if we dropped 2.7 support tomorrow then it wouldn't actually make > numpy unavailable on python 2.7; it would just mean that lots of users > stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the > end of the world ? numpy is mature software and 1.12 works pretty > well. The big problem IMO would be if this then meant that lots of > downstream projects felt that they had to continue supporting 1.12 > going forward, which makes it very difficult for us to effectively > ship new features or even bug fixes ? I mean, we can ship them, but > no-one will use them. And if a downstream project finds a bug in numpy > and can't upgrade numpy, then the tendency is to work around it > instead of reporting it upstream. I think this is the main thing we > want to avoid. > +1 > > This kind of means that we're at the mercy of downstream projects, > though ? if scipy/pandas/etc. decide they want to support 2.7 until > 2022, it might be in our best interest to do the same. But there's a > collective action problem here: we want to keep supporting 2.7 so long > as they do, but at the same time they may feel they need to keep > supporting 2.7 as long as we do. And all of us would prefer to drop > 2.7 support sooner rather than later, but we might all get stuck > because we're waiting for someone else to move first. > I don't quite agree about being stuck. These kind of upgrades should and usually do go top of stack to bottom. Something like Jupyter which is mostly an end user tool goes first (they announced 2020 quite a while ago), domain specific packages go at a similar time, then scipy & co, and only after that numpy. Cython will be even later I'm sure - it still supports Python 2.6. > > So my suggestion would be that numpy make some official announcement > that our plan is to drop support for python 2 immediately after > cpython upstream does. Not quite sure CPython schedule is relevant - important bug fixes haven't been making it into 2.7 for a very long time now, so the only change is the rare security patch. > If worst comes to worst we can always decide to > extend it at the time... but if we make the announcement now, then > it's less likely that we'll need to :-). > I'd be in favor of putting out a schedule in coordination with scipy/pandas/etc, but it probably should look more like - 2020: what's on http://www.python3statement.org/ now - 2021: scipy / pandas / scikit-learn / etc. - 2022: numpy Ralf > Another interesting project to look at here is django, since they > occupy a similar place in the ecosystem (e.g. last I checked numpy and > django are the two most-imported python packages on github): > https://www.djangoproject.com/weblog/2015/jun/25/roadmap/ > Their approach isn't directly applicable, because unlike us they have > a strict time-based release schedule, defined support period for each > release, and a distinction between regular and long-term support > releases, where regular releases act sort of like > pre-releases-on-steroids for the next LTS release. But basically what > they settled on is philosophically similar to what I'm suggesting: > they don't want an LTS to be supporting 2.7 beyond when cpython is > supporting it. Then on top of that they don't want to support 2.7 in > the regular releases leading up to that LTS either, so the net effect > is that their last release with 2.7 support came out last week, and it > will be supported until 2020 :-). And another useful precedent I think > is that they announced this two years ago, back in 2015; if we make an > announcement now, we'll be be giving a similar amount of warning. > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Apr 15 03:02:42 2017 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 15 Apr 2017 00:02:42 -0700 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Fri, Apr 14, 2017 at 10:47 PM, Ralf Gommers wrote: > > > On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith wrote: [...] >> From numpy's perspective, I feel like the most important reason to >> continue supporting 2.7 is our ability to convince people to keep >> upgrading. (Not the only reason, but the most important.) What I mean >> is: if we dropped 2.7 support tomorrow then it wouldn't actually make >> numpy unavailable on python 2.7; it would just mean that lots of users >> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the >> end of the world ? numpy is mature software and 1.12 works pretty >> well. The big problem IMO would be if this then meant that lots of >> downstream projects felt that they had to continue supporting 1.12 >> going forward, which makes it very difficult for us to effectively >> ship new features or even bug fixes ? I mean, we can ship them, but >> no-one will use them. And if a downstream project finds a bug in numpy >> and can't upgrade numpy, then the tendency is to work around it >> instead of reporting it upstream. I think this is the main thing we >> want to avoid. > > > +1 > >> >> >> This kind of means that we're at the mercy of downstream projects, >> though ? if scipy/pandas/etc. decide they want to support 2.7 until >> 2022, it might be in our best interest to do the same. But there's a >> collective action problem here: we want to keep supporting 2.7 so long >> as they do, but at the same time they may feel they need to keep >> supporting 2.7 as long as we do. And all of us would prefer to drop >> 2.7 support sooner rather than later, but we might all get stuck >> >> because we're waiting for someone else to move first. > > > I don't quite agree about being stuck. These kind of upgrades should and > usually do go top of stack to bottom. Something like Jupyter which is mostly > an end user tool goes first (they announced 2020 quite a while ago), domain > specific packages go at a similar time, then scipy & co, and only after that > numpy. Cython will be even later I'm sure - it still supports Python 2.6. To make sure we're on the same page about what "2020" means here: the latest release of IPython is 5.0, which came out in July last year. This is the last release that supports py2; they dropped support for py2 in master months ago, and 6.0 (whose schedule has been slipping, but I think should be out Any Time Now?) won't support py2. Their plan is to keep backporting bug fixes to 5.x until the end of 2017; after that the core team won't support py2 at all. And they've also announced that if volunteers want to step up to maintain 5.x after that, then they're willing to keep accepting pull requests until July 2019. Refs: https://blog.jupyter.org/2016/07/08/ipython-5-0-released/ https://github.com/jupyter/roadmap/blob/master/accepted/migration-to-python-3-only.md I suspect that in practice that "end of 2017" date will the end-of-support date for most intents and purposes. And for numpy with its vaguely defined support periods, I think it makes most sense to talk in terms of release dates; so if we want to compare apples-to-apples, my suggestion is that numpy drops py2 support in 2020 and in that sense ipython dropped py2 support in july last year. >> >> So my suggestion would be that numpy make some official announcement >> that our plan is to drop support for python 2 immediately after >> cpython upstream does. > > > Not quite sure CPython schedule is relevant - important bug fixes haven't > been making it into 2.7 for a very long time now, so the only change is the > rare security patch. Huh? 2.7 gets tons of changes: https://github.com/python/cpython/commits/2.7 Officially CPython has 2 modes for releases: "regular support" and "security fixes only". 2.7 is special ? it get regular support, and then on top of that it also has a special exception to allow certain kinds of major changes, like the ssl module backports. If you know of important bug fixes that they're missing then I think they'd like to know :-). Anyway, the reason the CPython schedule is relevant is that once they drop support, it *will* stop getting security patches, so it will become increasingly impossible to use safely. >> >> If worst comes to worst we can always decide to >> extend it at the time... but if we make the announcement now, then >> it's less likely that we'll need to :-). > > > I'd be in favor of putting out a schedule in coordination with > scipy/pandas/etc, but it probably should look more like > - 2020: what's on http://www.python3statement.org/ now > - 2021: scipy / pandas / scikit-learn / etc. Um... pandas is already on python3statement.org right now :-) > - 2022: numpy Honestly I don't see why we should plan to support python 2 a day longer than our major downstream dependencies. That was the point of my first paragraph: for us the main benefit to supporting 2 is to avoid forcing our downstream dependencies to pin an old numpy. What's that extra year get us if they've already moved on? The other odd thing about this schedule is that you're suggesting that the organizing principle should be that the stack switches from top-of-stack to bottom... but then you left out the bottom of the stack! :-) - 2020: python -n -- Nathaniel J. Smith -- https://vorpus.org From jtaylor.debian at googlemail.com Sat Apr 15 04:47:34 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sat, 15 Apr 2017 10:47:34 +0200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> On 15.04.2017 02:19, Charles R Harris wrote: > Hi All, > > It may be early to discuss dropping support for Python 2.7, but there is > a disturbance in the force that suggests that it might be worth looking > forward to the year 2020 when Python itself will drop support for 2.7. > There is also a website, http://www.python3statement.org > , where several projects in the > scientific python stack have pledged to be Python 2.7 free by that > date. Given that, a preliminary discussion of the subject might be > interesting, if only to gather information of where the community > currently stands. > > Chuck > > I am very against planning to drop it. Numpy is the lowest part of the scipy stack so it is not our decision to do so and we don't gain that much by doing so. Lets discuss this in 3 years or when the distributions kick out python2.7 (which won't happen before ~2022). There is no point doing so now. Also PyPy does not plan on dropping 2.7 by that time. Also before we even consider this we need to fix our python3 support. This means getting the IO functions (https://github.com/numpy/numpy/pull/4208) in order and adding a string type that people are less reluctant to use than the 4 byte unicode we currently offer. From perimosocordiae at gmail.com Sat Apr 15 08:49:18 2017 From: perimosocordiae at gmail.com (CJ Carey) Date: Sat, 15 Apr 2017 08:49:18 -0400 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> References: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> Message-ID: What do we think about the trade-offs of having a shared 2.7/3.x codebase going forward? As Python3 adds more nontrivial features, keeping compatibility with 2.7 becomes more burdensome. Will there be a separate py2-numpy branch/repo at some point before ending support? On Apr 15, 2017 4:48 AM, "Julian Taylor" wrote: > On 15.04.2017 02:19, Charles R Harris wrote: > > Hi All, > > > > It may be early to discuss dropping support for Python 2.7, but there is > > a disturbance in the force that suggests that it might be worth looking > > forward to the year 2020 when Python itself will drop support for 2.7. > > There is also a website, http://www.python3statement.org > > , where several projects in the > > scientific python stack have pledged to be Python 2.7 free by that > > date. Given that, a preliminary discussion of the subject might be > > interesting, if only to gather information of where the community > > currently stands. > > > > Chuck > > > > > > I am very against planning to drop it. > Numpy is the lowest part of the scipy stack so it is not our decision to > do so and we don't gain that much by doing so. > Lets discuss this in 3 years or when the distributions kick out > python2.7 (which won't happen before ~2022). There is no point doing so > now. > Also PyPy does not plan on dropping 2.7 by that time. > > Also before we even consider this we need to fix our python3 support. > This means getting the IO functions > (https://github.com/numpy/numpy/pull/4208) in order and adding a string > type that people are less reluctant to use than the 4 byte unicode we > currently offer. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sat Apr 15 10:17:01 2017 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sat, 15 Apr 2017 10:17:01 -0400 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> Message-ID: Hi All, I think Nathaniel had a good summary. My own 2? are mostly about the burden of supporting python2. I have only recently attempted to make changes in the C codebase of numpy and one of the reasons I found this more than a little daunting is the complex web of include files. In this respect, the python3/2 split is certainly not the biggest hindrance, but it was also not particularly helpful for understanding to have "translations" of python2 macros to python3 equivalents in npy_3kcompat.h: for newcomers, it would seem helpful if they could read the Python3 C-API and be able to understand what is going on. Of course, the above also proves Julian's point: for strings in particular, numpy still has a bit to go to be fully python3-ized. Finally, as for pypy: they just made a huge effort to become compatible with python3; is their plan really to stick with python2 much beyond 2020? All the best, Marten From jtaylor.debian at googlemail.com Sat Apr 15 10:30:18 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sat, 15 Apr 2017 16:30:18 +0200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> Message-ID: <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com> On 15.04.2017 16:17, Marten van Kerkwijk wrote: > Hi All, > > I think Nathaniel had a good summary. My own 2? are mostly about the > burden of supporting python2. I have only recently attempted to make > changes in the C codebase of numpy and one of the reasons I found this > more than a little daunting is the complex web of include files. In > this respect, the python3/2 split is certainly not the biggest > hindrance, but it was also not particularly helpful for understanding > to have "translations" of python2 macros to python3 equivalents in > npy_3kcompat.h: for newcomers, it would seem helpful if they could > read the Python3 C-API and be able to understand what is going on. > > Of course, the above also proves Julian's point: for strings in > particular, numpy still has a bit to go to be fully python3-ized. > > Finally, as for pypy: they just made a huge effort to become > compatible with python3; is their plan really to stick with python2 > much beyond 2020? > http://doc.pypy.org/en/latest/faq.html#how-long-will-pypy-support-python2 According to that Python2 support will be available as long as PyPy itself exists. From jtaylor.debian at googlemail.com Sat Apr 15 10:33:45 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sat, 15 Apr 2017 16:33:45 +0200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com> References: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com> <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com> Message-ID: <0591969a-3507-efbd-8017-9361724bb22a@googlemail.com> On 15.04.2017 16:30, Julian Taylor wrote: > On 15.04.2017 16:17, Marten van Kerkwijk wrote: >> Hi All, >> >> I think Nathaniel had a good summary. My own 2? are mostly about the >> burden of supporting python2. I have only recently attempted to make >> changes in the C codebase of numpy and one of the reasons I found this >> more than a little daunting is the complex web of include files. In >> this respect, the python3/2 split is certainly not the biggest >> hindrance, but it was also not particularly helpful for understanding >> to have "translations" of python2 macros to python3 equivalents in >> npy_3kcompat.h: for newcomers, it would seem helpful if they could >> read the Python3 C-API and be able to understand what is going on. >> >> Of course, the above also proves Julian's point: for strings in >> particular, numpy still has a bit to go to be fully python3-ized. >> >> Finally, as for pypy: they just made a huge effort to become >> compatible with python3; is their plan really to stick with python2 >> much beyond 2020? >> > > http://doc.pypy.org/en/latest/faq.html#how-long-will-pypy-support-python2 > > According to that Python2 support will be available as long as PyPy > itself exists. > Of course they don't support the stdlib itself, so this doesn't actually mean much depending on how the much community will care about fixing security issues in the python2 stdlib. But at least there might be a place where patches can get accepted and released. From charlesr.harris at gmail.com Sat Apr 15 11:44:43 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 15 Apr 2017 09:44:43 -0600 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Fri, Apr 14, 2017 at 11:47 PM, Ralf Gommers wrote: > > > On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith wrote: > >> On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris >> wrote: >> > Hi All, >> > >> > It may be early to discuss dropping support for Python 2.7, but there >> is a >> > disturbance in the force that suggests that it might be worth looking >> > forward to the year 2020 when Python itself will drop support for 2.7. >> There >> > is also a website, http://www.python3statement.org, where several >> projects >> > in the scientific python stack have pledged to be Python 2.7 free by >> that >> > date. Given that, a preliminary discussion of the subject might be >> > interesting, if only to gather information of where the community >> currently >> > stands. >> >> One reasonable position would that numpy releases that happen while >> 2.7 is supported upstream will also support 2.7, and releases after >> that won't. >> >> From numpy's perspective, I feel like the most important reason to >> continue supporting 2.7 is our ability to convince people to keep >> upgrading. (Not the only reason, but the most important.) What I mean >> is: if we dropped 2.7 support tomorrow then it wouldn't actually make >> numpy unavailable on python 2.7; it would just mean that lots of users >> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the >> end of the world ? numpy is mature software and 1.12 works pretty >> well. The big problem IMO would be if this then meant that lots of >> downstream projects felt that they had to continue supporting 1.12 >> going forward, which makes it very difficult for us to effectively >> ship new features or even bug fixes ? I mean, we can ship them, but >> no-one will use them. And if a downstream project finds a bug in numpy >> and can't upgrade numpy, then the tendency is to work around it >> instead of reporting it upstream. I think this is the main thing we >> want to avoid. >> > > +1 > > >> >> This kind of means that we're at the mercy of downstream projects, >> though ? if scipy/pandas/etc. decide they want to support 2.7 until >> 2022, it might be in our best interest to do the same. But there's a >> collective action problem here: we want to keep supporting 2.7 so long >> as they do, but at the same time they may feel they need to keep >> supporting 2.7 as long as we do. And all of us would prefer to drop >> 2.7 support sooner rather than later, but we might all get stuck >> > because we're waiting for someone else to move first. >> > > I don't quite agree about being stuck. These kind of upgrades should and > usually do go top of stack to bottom. Something like Jupyter which is > mostly an end user tool goes first (they announced 2020 quite a while ago), > domain specific packages go at a similar time, then scipy & co, and only > after that numpy. Cython will be even later I'm sure - it still supports > Python 2.6. > > >> >> So my suggestion would be that numpy make some official announcement >> that our plan is to drop support for python 2 immediately after >> cpython upstream does. > > > Not quite sure CPython schedule is relevant - important bug fixes haven't > been making it into 2.7 for a very long time now, so the only change is the > rare security patch. > > >> If worst comes to worst we can always decide to >> extend it at the time... but if we make the announcement now, then >> it's less likely that we'll need to :-). >> > > I'd be in favor of putting out a schedule in coordination with > scipy/pandas/etc, but it probably should look more like > - 2020: what's on http://www.python3statement.org/ now > - 2021: scipy / pandas / scikit-learn / etc. > - 2022: numpy > > I think things will move faster than one might think. In any case, we are probably about 5 releases away from 2020. As Nathaniel points out, numpy is mature and 1.12 is pretty good already, so hopefully 1.17 would be even better. I think dropping Python 2.7 support at that point would not cause much in the way of problems as 1.17 should be good for a number of years after that and would be easily installed from PyPI. A bigger driver long term might be uptake by distros, although the impact of that might be harder to estimate. I suspect it will affect developers more than end users, who will more likely be using Anaconda, Canopy, or similar to manage their development environment. Another thing to consider is that future developers will likely have less and less experience with Python 2.7 as teaching and classroom use moves to 3. Whatever we decide, I think Nathaniel's point about making an early announcement is a good one, as is Julian's comment about bringing Numpy into full support of Python 3. We need to put together a plan with at least a tentative schedule that will help get downstream projects thinking about their own plans and engender more feedback. It might be useful to have a BOF(s) at SciPy 2017 where the issue can be discussed with a broader range of people. Chuck > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Sat Apr 15 12:32:44 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sat, 15 Apr 2017 18:32:44 +0200 Subject: [Numpy-discussion] testing needed for f2py with char/string arrays Message-ID: <7fc9930b-2711-8503-78b2-5e90ad773017@googlemail.com> hi, we need to deprecate the NPY_CHAR typenumber [0] in order to enable us to add new core dtypes without adding ugly hacks to our ABI. Technically the typenumber was deprecated way back in 1.6 when it accidentally broke our ABI. But due to lack of time f2py never got updated to actually follow through. In order to unblock our dtype development cleanly we want to finally do the deprecation properly. As nobody really knows how f2py works and there are no existing unit tests covering the char dtype the change is very likely to break something. The change is available here: https://github.com/numpy/numpy/pull/8948 It attempts to map the NPY_CHAR dtype to the equivalent NPY_STRING with itemsize 1. I have only been able to come up with a test that covers one of the changed places. So if you have a f2py usecase that in some way involves passing arrays of strings back and forth between python and fortran, please test that branch or post a reproducable example here. Thanks, Julian [0] https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/ndarraytypes.h#L74 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From ralf.gommers at gmail.com Sat Apr 15 17:20:29 2017 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 16 Apr 2017 09:20:29 +1200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Sat, Apr 15, 2017 at 7:02 PM, Nathaniel Smith wrote: > On Fri, Apr 14, 2017 at 10:47 PM, Ralf Gommers > wrote: > > > > > > On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith wrote: > [...] > >> From numpy's perspective, I feel like the most important reason to > >> continue supporting 2.7 is our ability to convince people to keep > >> upgrading. (Not the only reason, but the most important.) What I mean > >> is: if we dropped 2.7 support tomorrow then it wouldn't actually make > >> numpy unavailable on python 2.7; it would just mean that lots of users > >> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the > >> end of the world ? numpy is mature software and 1.12 works pretty > >> well. The big problem IMO would be if this then meant that lots of > >> downstream projects felt that they had to continue supporting 1.12 > >> going forward, which makes it very difficult for us to effectively > >> ship new features or even bug fixes ? I mean, we can ship them, but > >> no-one will use them. And if a downstream project finds a bug in numpy > >> and can't upgrade numpy, then the tendency is to work around it > >> instead of reporting it upstream. I think this is the main thing we > >> want to avoid. > > > > > > +1 > > > >> > >> > >> This kind of means that we're at the mercy of downstream projects, > >> though ? if scipy/pandas/etc. decide they want to support 2.7 until > >> 2022, it might be in our best interest to do the same. But there's a > >> collective action problem here: we want to keep supporting 2.7 so long > >> as they do, but at the same time they may feel they need to keep > >> supporting 2.7 as long as we do. And all of us would prefer to drop > >> 2.7 support sooner rather than later, but we might all get stuck > >> > >> because we're waiting for someone else to move first. > > > > > > I don't quite agree about being stuck. These kind of upgrades should and > > usually do go top of stack to bottom. Something like Jupyter which is > mostly > > an end user tool goes first (they announced 2020 quite a while ago), > domain > > specific packages go at a similar time, then scipy & co, and only after > that > > numpy. Cython will be even later I'm sure - it still supports Python 2.6. > > To make sure we're on the same page about what "2020" means here: the > latest release of IPython is 5.0, which came out in July last year. > This is the last release that supports py2; they dropped support for > py2 in master months ago, and 6.0 (whose schedule has been slipping, > but I think should be out Any Time Now?) won't support py2. Their plan > is to keep backporting bug fixes to 5.x until the end of 2017; after > that the core team won't support py2 at all. And they've also > announced that if volunteers want to step up to maintain 5.x after > that, then they're willing to keep accepting pull requests until July > 2019. > > Refs: > https://blog.jupyter.org/2016/07/08/ipython-5-0-released/ > https://github.com/jupyter/roadmap/blob/master/accepted/migr > ation-to-python-3-only.md > > I suspect that in practice that "end of 2017" date will the > end-of-support date for most intents and purposes. And for numpy with > its vaguely defined support periods, I think it makes most sense to > talk in terms of release dates; agreed, release dates makes sense. we don't want to be doing some kind of LTS scheme. > so if we want to compare > apples-to-apples, my suggestion is that numpy drops py2 support in > 2020 and in that sense ipython dropped py2 support in july last year. > > >> > >> So my suggestion would be that numpy make some official announcement > >> that our plan is to drop support for python 2 immediately after > >> cpython upstream does. > > > > > > Not quite sure CPython schedule is relevant - important bug fixes haven't > > been making it into 2.7 for a very long time now, so the only change is > the > > rare security patch. > > Huh? 2.7 gets tons of changes: https://github.com/python/cpyt > hon/commits/2.7 You're right. My experience is ending up on bugs.python.org when debugging and the answer to "can this be backported to 2.7" usually being no - but it looks like my experience is skewed by distutils, which is not exactly well maintained. > Officially CPython has 2 modes for releases: "regular support" and > "security fixes only". 2.7 is special ? it get regular support, and > then on top of that it also has a special exception to allow certain > kinds of major changes, like the ssl module backports. > If you know of important bug fixes that they're missing then I think > they'd like to know :-). > Anyway, the reason the CPython schedule is relevant is that once they > drop support, it *will* stop getting security patches, so it will > become increasingly impossible to use safely. > For web stuff yes, but not all that relevant for scientific work. > > >> > >> If worst comes to worst we can always decide to > >> extend it at the time... but if we make the announcement now, then > >> it's less likely that we'll need to :-). > > > > > > I'd be in favor of putting out a schedule in coordination with > > scipy/pandas/etc, but it probably should look more like > > - 2020: what's on http://www.python3statement.org/ now > > - 2021: scipy / pandas / scikit-learn / etc. > > Um... pandas is already on python3statement.org right now :-) > > > - 2022: numpy > > Honestly I don't see why we should plan to support python 2 a day > longer than our major downstream dependencies. That was the point of > my first paragraph: for us the main benefit to supporting 2 is to > avoid forcing our downstream dependencies to pin an old numpy. What's > that extra year get us if they've already moved on? > > The other odd thing about this schedule is that you're suggesting that > the organizing principle should be that the stack switches from > top-of-stack to bottom... but then you left out the bottom of the > stack! :-) > I don't think of Python as part of the stack, because it's not upgradeable for most users (except for with conda). It's more like having a base platform (OS + compilers + Python version) on which you install a scientific stack which has numpy as its lowest level component. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Sun Apr 16 04:39:27 2017 From: antoine at python.org (Antoine Pitrou) Date: Sun, 16 Apr 2017 10:39:27 +0200 Subject: [Numpy-discussion] Long term plans for dropping Python 2.7 In-Reply-To: References: Message-ID: On Fri, 14 Apr 2017 22:19:42 -0700 Nathaniel Smith wrote: > > From numpy's perspective, I feel like the most important reason to > continue supporting 2.7 is our ability to convince people to keep > upgrading. (Not the only reason, but the most important.) What I mean > is: if we dropped 2.7 support tomorrow then it wouldn't actually make > numpy unavailable on python 2.7; it would just mean that lots of users > stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the > end of the world ? numpy is mature software and 1.12 works pretty > well. The big problem IMO would be if this then meant that lots of > downstream projects felt that they had to continue supporting 1.12 > going forward, which makes it very difficult for us to effectively > ship new features or even bug fixes ? I mean, we can ship them, but > no-one will use them. Everyone using Python 3, which is a large and growing number of people, will be able to use the new features. I think the model you've outlined above -- a kind of "LTS" Numpy version that supports 2.7 (with some amount of maintenance going on, at least to fix important bugs), and later feature releases being 3.x-only, is the right way forward. It will lighten maintenance of later versions, allow the Numpy codebase to use modern Python idioms and stdlib features, and will leave 2.x maintenance to people who really care about it. You may already have heard of it, but Django 1.11, which was just released, is the last feature release to support Python 2. Further feature releases of Django will only support Python 3. https://docs.djangoproject.com/en/1.11/releases/1.11/ Regards Antoine. From charlesr.harris at gmail.com Wed Apr 19 14:28:32 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 19 Apr 2017 12:28:32 -0600 Subject: [Numpy-discussion] Relaxed stride checking fixup Message-ID: Hi All, Currently numpy master has a bogus stride that will cause an error when downstream projects misuse it. That is done in order to help smoke out errors. Previously that bogus stride has been fixed up for releases, but that requires a special patch to be applied after each version branch is made. At this point I'd like to pick one or the other option and make the development and release branches the same in this regard. The question is: which option to choose? Keeping the fixup in master will remove some code and keep things simple, while not fixing up the release will possibly lead to more folks finding errors. At this point in time I am favoring applying the fixup in master. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Thu Apr 20 06:21:26 2017 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Thu, 20 Apr 2017 22:21:26 +1200 Subject: [Numpy-discussion] Relaxed stride checking fixup In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris wrote: > Hi All, > > Currently numpy master has a bogus stride that will cause an error when > downstream projects misuse it. That is done in order to help smoke out > errors. Previously that bogus stride has been fixed up for releases, but > that requires a special patch to be applied after each version branch is > made. At this point I'd like to pick one or the other option and make the > development and release branches the same in this regard. The question is: > which option to choose? Keeping the fixup in master will remove some code > and keep things simple, while not fixing up the release will possibly lead > to more folks finding errors. At this point in time I am favoring applying > the fixup in master. > > Thoughts? > If we have to pick then keeping the fixup sounds reasonable. Would there be value in making the behavior configurable at compile time? If there are more such things and they'd be behind a __NUMPY_DEBUG__ switch, then people may want to test that in their own CI. Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Thu Apr 20 09:15:27 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 20 Apr 2017 15:15:27 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays Message-ID: Hello, As you probably know numpy does not deal well with strings in Python3. The np.string type is actually zero terminated bytes and not a string. In Python2 this happened to work out as it treats bytes and strings the same way. But in Python3 this type is pretty hard to work with as each time you get an item from a numpy bytes array it needs decoding to receive a string. The only string type available in Python3 is np.unicode which uses 4-byte utf-32 encoding which is deemed to use too much memory to actually see much use. What people apparently want is a string type for Python3 which uses less memory for the common science use case which rarely needs more than latin1 encoding. As we have been told we cannot change the np.string type to actually be strings as existing programs do interpret its content as bytes despite this being very broken due to its null terminating property (it will ignore all trailing nulls). Also 8 years of working around numpy's poor python3 support decisions in third parties probably make the 'return bytes' behaviour impossible to change now. So we need a new dtype that can represent strings in numpy arrays which is smaller than the existing 4 byte utf-32. To please everyone I think we need to go with a dtype that supports multiple encodings via metadata, similar to how datatime supports multiple units. E.g.: 'U10[latin1]' are 10 characters in latin1 encoding Encodings we should support are: - latin1 (1 bytes): it is compatible with ascii and adds extra characters used in the western world. - utf-32 (4 bytes): can represent every character, equivalent with np.unicode Encodings we should maybe support: - utf-16 with explicitly disallowing surrogate pairs (2 bytes): this covers a very large range of possible characters in a reasonably compact representation - utf-8 (4 bytes): variable length encoding with minimum size of 1 bytes, but we would need to assume the worst case of 4 bytes so it would not save anything compared to utf-32 but may allow third parties replace an encoding step with trailing null trimming on serialization. To actually do this we have two options both of which break our ABI when doing so without ugly hacks. - Add a new dtype, e.g. npy.realstring By not modifying an existing type the only break programs using the NPY_CHAR. The most notable case of this is f2py. It has the cosmetic disadvantage that it makes the np.unicode dtype obsolete and is more busywork to implement. - Modify np.unicode to have encoding metadata This allows use to reuse of all the type boilerplate so it is more convenient to implement and by extending an existing type instead of making one obsolete it results in a much nicer API. The big drawback is that it will explicitly break any third party that receives an array with a new encoding and assumes that the buffer of an array of type np.unicode will a character itemsize of 4 bytes. To ease this problem we would need to add API's to get the itemsize and encoding to numpy now so third parties can error out cleanly. The implementation of it is not that big a deal, I have already created a prototype for adding latin1 metadata to np.unicode which works quite well. It is imo realistic to get this into 1.14 should we be able to make a decision on which way to implement it. Do you have comments on how to go forward, in particular in regards to new dtype vs modify np.unicode? cheers, Julian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From peridot.faceted at gmail.com Thu Apr 20 12:47:09 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Thu, 20 Apr 2017 16:47:09 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor wrote: > To please everyone I think we need to go with a dtype that supports > multiple encodings via metadata, similar to how datatime supports > multiple units. > E.g.: 'U10[latin1]' are 10 characters in latin1 encoding > > Encodings we should support are: > - latin1 (1 bytes): > it is compatible with ascii and adds extra characters used in the > western world. > - utf-32 (4 bytes): > can represent every character, equivalent with np.unicode > > Encodings we should maybe support: > - utf-16 with explicitly disallowing surrogate pairs (2 bytes): > this covers a very large range of possible characters in a reasonably > compact representation > - utf-8 (4 bytes): > variable length encoding with minimum size of 1 bytes, but we would need > to assume the worst case of 4 bytes so it would not save anything > compared to utf-32 but may allow third parties replace an encoding step > with trailing null trimming on serialization. > I should say first that I've never used even non-Unicode string arrays, but is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.) Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Apr 20 13:06:31 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 20 Apr 2017 10:06:31 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: Thanks so much for reviving this conversation -- we really do need to address this. My thoughts: What people apparently want is a string type for Python3 which uses less > memory for the common science use case which rarely needs more than > latin1 encoding. > Yes -- I think there is a real demand for that. https://en.wikipedia.org/wiki/ISO/IEC_8859-15 To please everyone I think we need to go with a dtype that supports > multiple encodings via metadata, similar to how datetime supports > multiple units. > E.g.: 'U10[latin1]' are 10 characters in latin1 encoding > I wonder if we really need that -- as you say, there is real demand for compact string type, but for many use cases, 1 byte per character is enough. So to keep things really simple, I think a single 1-byte per char encoding would meet most people's needs. What should that encoding be? latin-1 is obvious (and has the very nice property of being able to round-trip arbitrary bytes -- at least with Python's implementation) and scientific data sets tend to use the latin alphabet (with its ascii roots and all). But there is now latin-9: https://en.wikipedia.org/wiki/ISO/IEC_8859-15 Maybe a better option? Encodings we should support are: > - latin1 (1 bytes): > it is compatible with ascii and adds extra characters used in the > western world. > - utf-32 (4 bytes): > can represent every character, equivalent with np.unicode > IIUC, datetime64 is, well, always 64 bits. So it may be better to have a given dtype always be the same bitwidth. So the utf-32 dtype would be a different dtype. which also keeps it really simple, we have a latin-* dtype and a full-on unicode dtype -- that's it. Encodings we should maybe support: > - utf-16 with explicitly disallowing surrogate pairs (2 bytes): > this covers a very large range of possible characters in a reasonably > compact representation > I think UTF-16 is very simply, the worst of both worlds. If we want a two-byte character set, then it should be UCS-2 -- i.e. explicitly rejecting any code point that takes more than two bytes to represent. (or maybe that's what you mean by explicitly disallowing surrogate pairs). in any case, it should certainly give you an encoding error if you try to pass in a unicode character than can not fit into two bytes. So: is there actually a demand for this? If so, then I think it should be a separate 2-byte string type, with the encoding always the same. > - utf-8 (4 bytes): > variable length encoding with minimum size of 1 bytes, but we would need > to assume the worst case of 4 bytes so it would not save anything > compared to utf-32 but may allow third parties replace an encoding step > with trailing null trimming on serialization. > yeach -- utf-8 is great for interchange and streaming data, but not for internal storage, particular with the numpy every item has the same number of bytes requirement. So if someone wants to work with ut-8 they can store it in a byte array, and encode and decode as they pass it to/from python. That's going to have to happen anyway, even if under the hood. And it's risky business -- if you truncate a utf-8 bytestring, you may get invalid data -- it really does not belong in numpy. > - Add a new dtype, e.g. npy.realstring > I think that's the way to go. backwards compatibility is really key. Though could we make the existing string dtype a latin-1 always type without breaking too much? Or maybe depricate and get there in the future? It has the cosmetic disadvantage that it makes the np.unicode dtype > obsolete and is more busywork to implement. > I think the np.unicode type should remain as the 4-bytes per char encoding. But that only makes sense if you follow my idea that we don't have a variable number of bytes per char dtype. So my proposal is: - Create a new one-byte-per-char dtype that is always latin-9 encoded. - in python3 it would map to a string (i.e. unicode) - Keep the 4-byte per char unicode string type Optionally (if there is really demand) - Create a new two-byte per char dtype that is always UCS-2 encoded. Is there any way to leverage Python3's nifty string type? I'm thinking not. At least not for numpy arrays that can play well with C code, etc. All that being said, a encoding-specified string dtype would be nice too -- I just think it's more complex that it needs to be. Numpy is not the tool for text processing... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Apr 20 13:26:13 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 20 Apr 2017 10:26:13 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing. On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald wrote: > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are indeed more complicated. But is it strictly > necessary that string arrays hold fixed-length *strings*, or can the > encoding length be fixed instead? That is, currently if you try to assign a > longer string than will fit, the string is truncated to the number of > characters in the data type. Instead, for encoded Unicode, the string could > be truncated so that the encoding fits. Of course this is not completely > trivial for variable-length encodings, but it should be doable, and it > would allow UTF-8 to be used just the way it usually is - as an encoding > that's almost 8-bit. > I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters. In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype. The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Apr 20 13:28:14 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 20 Apr 2017 10:28:14 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald wrote: > Is there any reason not to support all Unicode encodings that python does, > with the same names and semantics? This would surely be the simplest to > understand. > I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model. > Also, if latin1 is to going to be the only practical 8-bit encoding, maybe > check with some non-Western users to make sure it's not going to wreck > their lives? I'd have selected ASCII as an encoding to treat specially, if > any, because Unicode already does that and the consequences are familiar. > (I'm used to writing and reading French without accents because it's passed > through ASCII, for example.) > latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-* Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding? > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are indeed more complicated. But is it strictly > necessary that string arrays hold fixed-length *strings*, or can the > encoding length be fixed instead? That is, currently if you try to assign a > longer string than will fit, the string is truncated to the number of > characters in the data type. > we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing. memory is cheap, compressing is fast -- we really shouldn't get hung up on this! Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too. I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad. I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors? Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array. -CHB it would allow UTF-8 to be used just the way it usually is - as an > encoding that's almost 8-bit. > ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake! This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later. All this said, it seems to me that the important use cases for string > arrays involve interaction with existing binary formats, so people who have > to deal with such data should have the final say. (My own closest approach > to this is the FITS format, which is restricted by the standard to ASCII.) > yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either. But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Apr 20 13:36:42 2017 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 20 Apr 2017 17:36:42 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included? On Thu, Apr 20, 2017 at 1:32 PM Chris Barker wrote: > On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald > wrote: > >> Is there any reason not to support all Unicode encodings that python >> does, with the same names and semantics? This would surely be the simplest >> to understand. >> > > I think it should support all fixed-length encodings, but not the > non-fixed length ones -- they just don't fit well into the numpy data model. > > >> Also, if latin1 is to going to be the only practical 8-bit encoding, >> maybe check with some non-Western users to make sure it's not going to >> wreck their lives? I'd have selected ASCII as an encoding to treat >> specially, if any, because Unicode already does that and the consequences >> are familiar. (I'm used to writing and reading French without accents >> because it's passed through ASCII, for example.) >> > > latin-1 (or latin-9) only makes things better than ASCII -- it buys most > of the accented characters for the European language and some symbols that > are nice to have (I use the degree symbol a lot...). And it is ASCII > compatible -- so there is NO reason to choose ASCII over Latin-* > > Which does no good for non-latin languages -- so we need to hear from the > community -- is there a substantial demand for a non-latin one-byte per > character encoding? > > >> Variable-length encodings, of which UTF-8 is obviously the one that makes >> good handling essential, are indeed more complicated. But is it strictly >> necessary that string arrays hold fixed-length *strings*, or can the >> encoding length be fixed instead? That is, currently if you try to assign a >> longer string than will fit, the string is truncated to the number of >> characters in the data type. >> > > we could do that, yes, but an improperly truncated "string" becomes > invalid -- just seems like a recipe for bugs that won't be found in testing. > > memory is cheap, compressing is fast -- we really shouldn't get hung up on > this! > > Note: if you are storing a LOT of text (which I have no idea why you would > use numpy anyway), then the memory size might matter, but then > semi-arbitrary truncation would probably matter, too. > > I expect most text storage in numpy arrays is things like names of > datasets, ids, etc, etc -- not massive amounts of text -- so storage space > really isn't critical. but having an id or something unexpectedly truncated > could be bad. > > I think practical experience has shown us that people do not handle > "mostly fixed length but once in awhile not" text well -- see the nightmare > of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so > errors are far more likely to be found in tests (why would you use utf-8 is > all your data are in ascii???). but still -- why invite hard to test for > errors? > > Final point -- as Julian suggests, one reason to support utf-8 is for > interoperability with other systems -- but that makes errors more of an > issue -- if it doesn't pass through the numpy truncation machinery, invalid > data could easily get put in a numpy array. > > -CHB > > it would allow UTF-8 to be used just the way it usually is - as an >> encoding that's almost 8-bit. >> > > ouch! that perception is the route to way too many errors! it is by no > means almost 8-bit, unless your data are almost ascii -- in which case, use > latin-1 for pity's sake! > > This highlights my point though -- if we support UTF-8, people WILL use > it, and only test it with mostly-ascii text, and not find the bugs that > will crop up later. > > All this said, it seems to me that the important use cases for string >> arrays involve interaction with existing binary formats, so people who have >> to deal with such data should have the final say. (My own closest approach >> to this is the FITS format, which is restricted by the standard to ASCII.) >> > > yup -- not sure we'll get much guidance here though -- netdf does not > solve this problem well, either. > > But if you are pulling, say, a utf-8 encoded string out of a netcdf file > -- it's probably better to pull it out as bytes and pass it through the > python decoding/encoding machinery than pasting the bytes straight to a > numpy array and hope that the encoding and truncation are correct. > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Apr 20 13:43:18 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 20 Apr 2017 10:43:18 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a > fixed size per array element, but that doesn't mean we need a fixed sized > per character. Each element in a UTF-8 array would be a string with a fixed > number of codepoints, not characters. > Ah, yes -- the nightmare of Unicode! No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters. As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4. So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct). But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if: you need fixed-length storage you care about compactness In fact, we already have this sort of distinction between element size and > memory usage: np.string_ uses null padding to store shorter strings in a > larger dtype. > sure -- but it is clear to the user that the dtype can hold "up to this many" characters. > The only reason I see for supporting encodings other than UTF-8 is for > memory-mapping arrays stored with those encodings, but that seems like a > lot of extra trouble for little gain. > I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-) On the other hand, if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options. I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained. All that being said, if the truncation code were carefully written, it would mostly "just work" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Apr 20 13:46:31 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 20 Apr 2017 10:46:31 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker wrote: > I'm no unicode expert, but can't we truncate unicode strings so that only > valid characters are included? > sure -- it's just a bit fiddly -- and you need to make sure that everything gets passed through the proper mechanism. numpy is all about folks using other code to mess with the bytes in a numpy array. so we can't expect that all numpy string arrays will have been created with numpy code. Does python's string have a truncated encode option? i.e. you don't want to encode to utf-8 and then just chop it off. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Thu Apr 20 13:58:26 2017 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Thu, 20 Apr 2017 17:58:26 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: > if you truncate a utf-8 bytestring, you may get invalid data Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8. Also, is silent truncation a think that we want to allow to happen anyway? That sounds like something the user ought to be alerted to with an exception. > if you wanted to specify that a numpy element would be able to hold, say, N characters > ... > It simply is not the right way to handle text if [...] you need fixed-length storage It seems to me that counting code points is pretty futile in unicode, due to combining characters. The only two meaningful things to count are: * Graphemes, as that's what the user sees visually. These can span multiple code-points * Bytes of encoded data, as that's the space needed to store them So I would argue that the approach of fixed-codepoint-length storage is itself a flawed design, and so should not be used as a constraint on numpy. Counting graphemes is hard, so that leaves the only sensible option as a byte count. I don't forsee variable-length encodings being a problem implementation-wise - they only become one if numpy were to acquire a vectorized substring function that is intended to return a view. I think I'd be in favor of supporting all encodings, and falling back on python to handle encoding/decoding them. On Thu, 20 Apr 2017 at 18:44 Chris Barker wrote: > On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > >> I agree with Anne here. Variable-length encoding would be great to have, >> but even fixed length UTF-8 (in terms of memory usage, not characters) >> would solve NumPy's Python 3 string problem. NumPy's memory model needs a >> fixed size per array element, but that doesn't mean we need a fixed sized >> per character. Each element in a UTF-8 array would be a string with a fixed >> number of codepoints, not characters. >> > > Ah, yes -- the nightmare of Unicode! > > No, it would not be a fixed number of codepoints -- it would be a fixed > number of bytes (or "code units"). and an unknown number of characters. > > As Julian pointed out, if you wanted to specify that a numpy element would > be able to hold, say, N characters (actually code points, combining > characters make this even more confusing) then you would need to allocate > N*4 bytes to make sure you could hold any string that long. Which would be > pretty pointless -- better to use UCS-4. > > So Anne's suggestion that numpy truncates as needed would make sense -- > you'd specify say N characters, numpy would arbitrarily (or user specified) > over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a > string that didn't fit. Then you'd need to make sure you truncated > correctly, so as not to create an invalid string (that's just code, it > could be made correct). > > But how much to over allocate? for english text, with an occasional > scientific symbol, only a little. for, say, Japanese text, you'd need a > factor 2 maybe? > > Anyway, the idea that "just use utf-8" solves your problems is really > dangerous. It simply is not the right way to handle text if: > > you need fixed-length storage > you care about compactness > > In fact, we already have this sort of distinction between element size and >> memory usage: np.string_ uses null padding to store shorter strings in a >> larger dtype. >> > > sure -- but it is clear to the user that the dtype can hold "up to this > many" characters. > > >> The only reason I see for supporting encodings other than UTF-8 is for >> memory-mapping arrays stored with those encodings, but that seems like a >> lot of extra trouble for little gain. >> > > I see it the other way around -- the only reason TO support utf-8 is for > memory mapping with other systems that use it :-) > > On the other hand, if we ARE going to support utf-8 -- maybe use it for > all unicode support, rather than messing around with all the multiple > encoding options. > > I think a 1-byte-per char latin-* encoded string is a good idea though -- > scientific use tend to be latin only and space constrained. > > All that being said, if the truncation code were carefully written, it > would mostly "just work" > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Thu Apr 20 14:15:49 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 20 Apr 2017 20:15:49 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate. My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. On 20.04.2017 15:15, Julian Taylor wrote: > Hello, > As you probably know numpy does not deal well with strings in Python3. > The np.string type is actually zero terminated bytes and not a string. > In Python2 this happened to work out as it treats bytes and strings the > same way. But in Python3 this type is pretty hard to work with as each > time you get an item from a numpy bytes array it needs decoding to > receive a string. > The only string type available in Python3 is np.unicode which uses > 4-byte utf-32 encoding which is deemed to use too much memory to > actually see much use. > > What people apparently want is a string type for Python3 which uses less > memory for the common science use case which rarely needs more than > latin1 encoding. > As we have been told we cannot change the np.string type to actually be > strings as existing programs do interpret its content as bytes despite > this being very broken due to its null terminating property (it will > ignore all trailing nulls). > Also 8 years of working around numpy's poor python3 support decisions in > third parties probably make the 'return bytes' behaviour impossible to > change now. > > So we need a new dtype that can represent strings in numpy arrays which > is smaller than the existing 4 byte utf-32. > > To please everyone I think we need to go with a dtype that supports > multiple encodings via metadata, similar to how datatime supports > multiple units. > E.g.: 'U10[latin1]' are 10 characters in latin1 encoding > > Encodings we should support are: > - latin1 (1 bytes): > it is compatible with ascii and adds extra characters used in the > western world. > - utf-32 (4 bytes): > can represent every character, equivalent with np.unicode > > Encodings we should maybe support: > - utf-16 with explicitly disallowing surrogate pairs (2 bytes): > this covers a very large range of possible characters in a reasonably > compact representation > - utf-8 (4 bytes): > variable length encoding with minimum size of 1 bytes, but we would need > to assume the worst case of 4 bytes so it would not save anything > compared to utf-32 but may allow third parties replace an encoding step > with trailing null trimming on serialization. > > > To actually do this we have two options both of which break our ABI when > doing so without ugly hacks. > > - Add a new dtype, e.g. npy.realstring > By not modifying an existing type the only break programs using the > NPY_CHAR. The most notable case of this is f2py. > It has the cosmetic disadvantage that it makes the np.unicode dtype > obsolete and is more busywork to implement. > > - Modify np.unicode to have encoding metadata > This allows use to reuse of all the type boilerplate so it is more > convenient to implement and by extending an existing type instead of > making one obsolete it results in a much nicer API. > The big drawback is that it will explicitly break any third party that > receives an array with a new encoding and assumes that the buffer of an > array of type np.unicode will a character itemsize of 4 bytes. > To ease this problem we would need to add API's to get the itemsize and > encoding to numpy now so third parties can error out cleanly. > > The implementation of it is not that big a deal, I have already created > a prototype for adding latin1 metadata to np.unicode which works quite > well. It is imo realistic to get this into 1.14 should we be able to > make a decision on which way to implement it. > > Do you have comments on how to go forward, in particular in regards to > new dtype vs modify np.unicode? > > cheers, > Julian > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From shoyer at gmail.com Thu Apr 20 14:16:34 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 20 Apr 2017 11:16:34 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker wrote: > On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > >> I agree with Anne here. Variable-length encoding would be great to have, >> but even fixed length UTF-8 (in terms of memory usage, not characters) >> would solve NumPy's Python 3 string problem. NumPy's memory model needs a >> fixed size per array element, but that doesn't mean we need a fixed sized >> per character. Each element in a UTF-8 array would be a string with a fixed >> number of codepoints, not characters. >> > > Ah, yes -- the nightmare of Unicode! > > No, it would not be a fixed number of codepoints -- it would be a fixed > number of bytes (or "code units"). and an unknown number of characters. > Apologies for confusing the terminology! Yes, this would mean a fixed number of bytes and an unknown number of characters. > As Julian pointed out, if you wanted to specify that a numpy element would > be able to hold, say, N characters (actually code points, combining > characters make this even more confusing) then you would need to allocate > N*4 bytes to make sure you could hold any string that long. Which would be > pretty pointless -- better to use UCS-4. > It's already unsafe to try to insert arbitrary length strings into a numpy string_ or unicode_ array. When determining the dtype automatically (e.g., with np.array(list_of_strings)), the difference is that numpy would need to check the maximum encoded length instead of the character length (i.e., len(x.encode() instead of len(x)). I certainly would not over-allocate. If users want more space, they can explicitly choose an appropriate size. (This is an hazard of not having length length dtypes.) If users really want to be able to fit an arbitrary number of unicode characters and aren't concerned about memory usage, they can still use np.unicode_ -- that won't be going away. > So Anne's suggestion that numpy truncates as needed would make sense -- > you'd specify say N characters, numpy would arbitrarily (or user specified) > over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a > string that didn't fit. Then you'd need to make sure you truncated > correctly, so as not to create an invalid string (that's just code, it > could be made correct). > NumPy already does this sort of silent truncation with longer strings inserted into shorter string dtypes. The different here would indeed be the need to check the number of bytes represented by the string instead of the number of characters. But I don't think this is useful behavior to bring over to a new dtype. We should error instead of silently truncating. This is certainly easier than trying to figure out when we would be splitting a character. > But how much to over allocate? for english text, with an occasional > scientific symbol, only a little. for, say, Japanese text, you'd need a > factor 2 maybe? > > Anyway, the idea that "just use utf-8" solves your problems is really > dangerous. It simply is not the right way to handle text if: > > you need fixed-length storage > you care about compactness > > In fact, we already have this sort of distinction between element size and >> memory usage: np.string_ uses null padding to store shorter strings in a >> larger dtype. >> > > sure -- but it is clear to the user that the dtype can hold "up to this > many" characters. > As Yu Feng points out in this GitHub comment, non-latin language speakers are already aware of the difference between string length and bytes length: https://github.com/numpy/numpy/pull/8942#issuecomment-294409192 Making an API based on code units instead of code points really seems like the saner way to handle unicode strings. I agree with this section with the DyND design docs for it's string type, which notes precedent from Julia and Go: https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code-unit-api-not-code-point I think a 1-byte-per char latin-* encoded string is a good idea though -- > scientific use tend to be latin only and space constrained. > I think scientific users tend be to ASCII only, so UTF-8 would also work transparently :). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Apr 20 14:22:52 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 20 Apr 2017 12:22:52 -0600 Subject: [Numpy-discussion] Relaxed stride checking fixup In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 4:21 AM, Ralf Gommers wrote: > > > On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > >> Hi All, >> >> Currently numpy master has a bogus stride that will cause an error when >> downstream projects misuse it. That is done in order to help smoke out >> errors. Previously that bogus stride has been fixed up for releases, but >> that requires a special patch to be applied after each version branch is >> made. At this point I'd like to pick one or the other option and make the >> development and release branches the same in this regard. The question is: >> which option to choose? Keeping the fixup in master will remove some code >> and keep things simple, while not fixing up the release will possibly lead >> to more folks finding errors. At this point in time I am favoring applying >> the fixup in master. >> >> Thoughts? >> > > If we have to pick then keeping the fixup sounds reasonable. Would there > be value in making the behavior configurable at compile time? If there are > more such things and they'd be behind a __NUMPY_DEBUG__ switch, then people > may want to test that in their own CI. > Interesting thought. I wonder what else might be a good candidate for such a switch? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Thu Apr 20 14:23:11 2017 From: antoine at python.org (Antoine Pitrou) Date: Thu, 20 Apr 2017 20:23:11 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, 20 Apr 2017 10:26:13 -0700 Stephan Hoyer wrote: > > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a > fixed size per array element, but that doesn't mean we need a fixed sized > per character. Each element in a UTF-8 array would be a string with a fixed > number of codepoints, not characters. > > In fact, we already have this sort of distinction between element size and > memory usage: np.string_ uses null padding to store shorter strings in a > larger dtype. > > The only reason I see for supporting encodings other than UTF-8 is for > memory-mapping arrays stored with those encodings, but that seems like a > lot of extra trouble for little gain. I think you want at least: ascii, utf8, ucs2 (aka utf16 without surrogates), utf32. That is, 3 common fixed width encodings and one variable width encoding. Regards Antoine. From robert.kern at gmail.com Thu Apr 20 14:53:53 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 11:53:53 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > Do you have comments on how to go forward, in particular in regards to > new dtype vs modify np.unicode? Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases. FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things). So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions. For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string. I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R. If I had to jump ahead and propose new dtypes, I might suggest this: * For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. * Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. * Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied. * Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Apr 20 15:05:11 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 20 Apr 2017 12:05:11 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: > I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only variable-length > UTF8 strings. > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: https://github.com/PyTables/PyTables/issues/499 https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From peridot.faceted at gmail.com Thu Apr 20 14:59:44 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Thu, 20 Apr 2017 18:59:44 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor wrote: > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. > In the end what will be done allows any encoding via a dtype with > metadata like datetime. > This allows any codec (including truncated utf8) to be added easily (if > python supports it) and allows sidestepping the debate. > > My main concern is whether it should be a new dtype or modifying the > unicode dtype. Though the backward compatibility argument is strongly in > favour of adding a new dtype that makes the np.unicode type redundant. > Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.) Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Thu Apr 20 15:15:33 2017 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Thu, 20 Apr 2017 19:15:33 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary? Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code carelessly dropping it? Is this a problem in both C and python, or just C? If that's the case, can we end up with a compromise where being careless just causes old code to promote to ucs32? On Thu, 20 Apr 2017 at 20:09 Anne Archibald wrote: > On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor < > jtaylor.debian at googlemail.com> wrote: > >> I probably have formulated my goal with the proposal a bit better, I am >> not very interested in a repetition of which encoding to use debate. >> In the end what will be done allows any encoding via a dtype with >> metadata like datetime. >> This allows any codec (including truncated utf8) to be added easily (if >> python supports it) and allows sidestepping the debate. >> >> My main concern is whether it should be a new dtype or modifying the >> unicode dtype. Though the backward compatibility argument is strongly in >> favour of adding a new dtype that makes the np.unicode type redundant. >> > > Creating a new dtype to handle encoded unicode, with the encoding > specified in the dtype, sounds perfectly reasonable to me. Changing the > behaviour of the existing unicode dtype seems like it's going to lead to > massive headaches unless exactly nobody uses it. The only downside to a new > type is having to find an obvious name that isn't already in use. (And > having to actively maintain/deprecate the old one.) > > Anne > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peridot.faceted at gmail.com Thu Apr 20 15:17:39 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Thu, 20 Apr 2017 19:17:39 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.debian at googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I feel like we ended up with the > current sub-optimal situation because we never really laid out the use > cases. We just felt like we needed bytestring and unicode dtypes, more out > of completionism than anything, and we made a bunch of assumptions just to > get each one done. I think there may be broad agreement that many of those > assumptions are "wrong", but it would be good to reference that against > concretely-stated use cases. > +1 > FWIW, if I need to work with in-memory arrays of strings in Python code, > I'm going to use dtype=object a la pandas. It has almost no arbitrary > constraints, and I can rely on Python's unicode facilities freely. There > may be some cases where it's a little less memory-efficient (e.g. > representing a column of enumerated single-character values like 'M'/'F'), > but that's never prevented me from doing anything (compare to the > uniform-length restrictions, which *have* prevented me from doing things). > > So what's left? Being able to memory-map to files that have string data > conveniently laid out according to numpy assumptions (e.g. FITS). Being > able to work with C/C++/Fortran APIs that have arrays of strings laid out > according to numpy assumptions (e.g. HDF5). I think it would behoove us to > canvass the needs of these formats and APIs before making any more > assumptions. > > For example, to my understanding, FITS files more or less follow numpy > assumptions for its string columns (i.e. uniform-length). But it enforces > 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the > singular motivating use case for the trailing-NULL behavior of np.string. > Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. [...] > If I had to jump ahead and propose new dtypes, I might suggest this: > > * For the most part, treat the string dtypes as temporary communication > formats rather than the preferred in-memory working format, similar to how > we use `float16` to communicate with GPU APIs. > > * Acknowledge the use cases of the current NULL-terminated np.string > dtype, but perhaps add a new canonical alias, document it as being for > those specific use cases, and deprecate/de-emphasize the current name. > > * Add a dtype for holding uniform-length `bytes` strings. This would be > similar to the current `void` dtype, but work more transparently with the > `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` > like `float64` does with `float`. This would not be NULL-terminated. No > encoding would be implied. > How would this differ from a numpy array of bytes with one more dimension? > * Maybe add a dtype similar to `object_` that only permits `unicode/str` > (2.x/3.x) strings (and maybe None to represent missing data a la pandas). > This maintains all of the flexibility of using a `dtype=object` array while > allowing code to specialize for working with strings without all kinds of > checking on every item. But most importantly, we can serialize such an > array to bytes without having to use pickle. Utility functions could be > written for en-/decoding to/from the uniform-length bytestring arrays > handling different encodings and things like NULL-termination (also working > with the legacy dtypes and handling structured arrays easily, etc.). > I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Apr 20 15:17:48 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 12:17:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. > > > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: > https://github.com/PyTables/PyTables/issues/499 > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html > > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters. Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Thu Apr 20 15:24:35 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 20 Apr 2017 13:24:35 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.debian at googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I feel like we ended up with the > current sub-optimal situation because we never really laid out the use > cases. We just felt like we needed bytestring and unicode dtypes, more out > of completionism than anything, and we made a bunch of assumptions just to > get each one done. I think there may be broad agreement that many of those > assumptions are "wrong", but it would be good to reference that against > concretely-stated use cases. > > FWIW, if I need to work with in-memory arrays of strings in Python code, > I'm going to use dtype=object a la pandas. It has almost no arbitrary > constraints, and I can rely on Python's unicode facilities freely. There > may be some cases where it's a little less memory-efficient (e.g. > representing a column of enumerated single-character values like 'M'/'F'), > but that's never prevented me from doing anything (compare to the > uniform-length restrictions, which *have* prevented me from doing things). > > So what's left? Being able to memory-map to files that have string data > conveniently laid out according to numpy assumptions (e.g. FITS). Being > able to work with C/C++/Fortran APIs that have arrays of strings laid out > according to numpy assumptions (e.g. HDF5). I think it would behoove us to > canvass the needs of these formats and APIs before making any more > assumptions. > > For example, to my understanding, FITS files more or less follow numpy > assumptions for its string columns (i.e. uniform-length). But it enforces > 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the > singular motivating use case for the trailing-NULL behavior of np.string. > > I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only variable-length > UTF8 strings. > > We should look at some of the newer formats and APIs, like Parquet and > Arrow, and also consider the cross-language APIs with Julia and R. > > If I had to jump ahead and propose new dtypes, I might suggest this: > > * For the most part, treat the string dtypes as temporary communication > formats rather than the preferred in-memory working format, similar to how > we use `float16` to communicate with GPU APIs. > > * Acknowledge the use cases of the current NULL-terminated np.string > dtype, but perhaps add a new canonical alias, document it as being for > those specific use cases, and deprecate/de-emphasize the current name. > > * Add a dtype for holding uniform-length `bytes` strings. This would be > similar to the current `void` dtype, but work more transparently with the > `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` > like `float64` does with `float`. This would not be NULL-terminated. No > encoding would be implied. > > * Maybe add a dtype similar to `object_` that only permits `unicode/str` > (2.x/3.x) strings (and maybe None to represent missing data a la pandas). > This maintains all of the flexibility of using a `dtype=object` array while > allowing code to specialize for working with strings without all kinds of > checking on every item. But most importantly, we can serialize such an > array to bytes without having to use pickle. Utility functions could be > written for en-/decoding to/from the uniform-length bytestring arrays > handling different encodings and things like NULL-termination (also working > with the legacy dtypes and handling structured arrays easily, etc.). > > A little history, IIRC, storing null terminated strings in fixed byte lengths was done in Fortran, strings were usually stored in integers/integer_arrays. If memory mapping of arbitrary types is not important, I'd settle for ascii or latin-1, utf-8 fixed byte length, and arrays of fixed python object type. Using one byte encodings and utf-8 avoids needing to deal with endianess. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Thu Apr 20 15:27:17 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 20 Apr 2017 21:27:17 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com> On 20.04.2017 20:53, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > > wrote: > >> Do you have comments on how to go forward, in particular in regards to >> new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I feel like we ended up with > the current sub-optimal situation because we never really laid out the > use cases. We just felt like we needed bytestring and unicode dtypes, > more out of completionism than anything, and we made a bunch of > assumptions just to get each one done. I think there may be broad > agreement that many of those assumptions are "wrong", but it would be > good to reference that against concretely-stated use cases. We ended up in this situation because we did not take the opportunity to break compatibility when python3 support was added. We should have made the string dtype an encoded byte type (ascii or latin1) in python3 instead of null terminated unencoded bytes which do not make very much practical sense. So the use case is very simple: Give users of the string dtype a migration path that does not involve converting to full utf32 unicode. The latin1 encoded bytes dtype would allow that. As we already have the infrastructure this same dtype can allow more than just latin1 with minimal effort, for the fixed size python supported stuff it is literally adding an enum entry, two new switch clauses and a little bit of dtype string parsing and testcases. Having some form of variable string handling would be nice. But this is another topic all together. Having builtin support for variable strings only seems overkill as the string dtype is not that important and object arrays should work reasonably well for this usecase already. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From rainwoodman at gmail.com Thu Apr 20 15:34:24 2017 From: rainwoodman at gmail.com (Feng Yu) Date: Thu, 20 Apr 2017 12:34:24 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: I suggest a new data type 'text[encoding]', 'T'. 1. text can be cast to python strings via decoding. 2. Conceptually casting to python bytes first cast to a string then calls encode(); the current encoding in the meta data is used by default, but the new encoding can be overridden. I slightly favour 'T16' as a fixed size, text record backed by 16 bytes. This way over-allocation is forcefully delegated to the user, simplifying numpy array. Yu On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> >> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern >> wrote: >>> >>> I don't know of a format off-hand that works with numpy uniform-length >>> strings and Unicode as well. HDF5 (to my recollection) supports arrays of >>> NULL-terminated, uniform-length ASCII like FITS, but only variable-length >>> UTF8 strings. >> >> >> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and >> variable length versions: >> https://github.com/PyTables/PyTables/issues/499 >> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html >> >> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for >> storage, not the number of characters. > > Ah, okay, I was interpolating from a quick perusal of the h5py docs, which > of course are also constrained by numpy's current set of dtypes. The > NULL-terminated ASCII works well enough with np.string's semantics. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From jtaylor.debian at googlemail.com Thu Apr 20 15:40:12 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Thu, 20 Apr 2017 21:40:12 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On 20.04.2017 20:59, Anne Archibald wrote: > On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor > > > wrote: > > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. > In the end what will be done allows any encoding via a dtype with > metadata like datetime. > This allows any codec (including truncated utf8) to be added easily (if > python supports it) and allows sidestepping the debate. > > My main concern is whether it should be a new dtype or modifying the > unicode dtype. Though the backward compatibility argument is strongly in > favour of adding a new dtype that makes the np.unicode type redundant. > > > Creating a new dtype to handle encoded unicode, with the encoding > specified in the dtype, sounds perfectly reasonable to me. Changing the > behaviour of the existing unicode dtype seems like it's going to lead to > massive headaches unless exactly nobody uses it. The only downside to a > new type is having to find an obvious name that isn't already in use. > (And having to actively maintain/deprecate the old one.) > > Anne > We wouldn't really be changing the behaviour of the unicode dtype. Only programs accessing the databuffer directly and trying to decode would need to be changed. I assume this can happen for programs that do serialization + reencoding of numpy string arrays at the C level (at the python level you would be fine). These programs would be broken, but only when they actually receive a string array that does not have the default utf32 encoding. I really don't like that a fully new dtype means creating more junk and extra code paths to numpy. But it is probably do big of a compatibility break to accept to keep our code clean. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From robert.kern at gmail.com Thu Apr 20 15:46:21 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 12:46:21 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald wrote: > > On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: >> For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string. > > Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. Never mind, then. :-) >> If I had to jump ahead and propose new dtypes, I might suggest this: >> >> * For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. >> >> * Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. >> >> * Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied. > > How would this differ from a numpy array of bytes with one more dimension? The scalar in the implementation being the scalar in the use case, immutability of the scalar, directly working with b'' strings in and out (and thus work with the Python codecs easily). >> * Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.). > > I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. HDF5 seems to support this, but only for ASCII and UTF8, not a large list of encodings. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Apr 20 15:51:57 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 20 Apr 2017 12:51:57 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern > wrote: > >> > >> I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only variable-length > UTF8 strings. > > > > > > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed > and variable length versions: > > https://github.com/PyTables/PyTables/issues/499 > > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html > > > > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for > storage, not the number of characters. > > Ah, okay, I was interpolating from a quick perusal of the h5py docs, which > of course are also constrained by numpy's current set of dtypes. The > NULL-terminated ASCII works well enough with np.string's semantics. > Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should correspond to a string type, not np.string_ (which is really bytes). -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Apr 20 16:00:48 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 13:00:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com> References: <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com> Message-ID: On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > > On 20.04.2017 20:53, Robert Kern wrote: > > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > > > > wrote: > > > >> Do you have comments on how to go forward, in particular in regards to > >> new dtype vs modify np.unicode? > > > > Can we restate the use cases explicitly? I feel like we ended up with > > the current sub-optimal situation because we never really laid out the > > use cases. We just felt like we needed bytestring and unicode dtypes, > > more out of completionism than anything, and we made a bunch of > > assumptions just to get each one done. I think there may be broad > > agreement that many of those assumptions are "wrong", but it would be > > good to reference that against concretely-stated use cases. > > We ended up in this situation because we did not take the opportunity to > break compatibility when python3 support was added. Oh, the root cause I'm thinking of long predates Python 3, or even numpy 1.0. There never was an explicitly fleshed out use case for unicode arrays other than "Python has unicode strings, so we should have a string dtype that supports it". Hence the "we only support UCS4" implementation; it's not like anyone *wants* UCS4 or interoperates with UCS4, but it does represent all possible Unicode strings. The Python 3 transition merely exacerbated the problem by making Unicode strings the primary string type to work with. I don't really want to ameliorate the exacerbation without addressing the root problem, which is worth solving. I will put this down as a marker use case: Support HDF5's fixed-width UTF-8 arrays. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Thu Apr 20 16:01:25 2017 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 20 Apr 2017 16:01:25 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: > I suggest a new data type 'text[encoding]', 'T'. I like the suggestion very much (it is even in between S and U!). The utf-8 manifesto linked to above convinced me that the number that should follow is the number of bytes, which is nicely consistent with use in all numerical dtypes. Any way, more specifically on Julian's question: it seems to me one has little choice but to make a new dtype (and OK if that makes unicode obsolete). I think what exact encodings to support is a separate question. -- Marten From robert.kern at gmail.com Thu Apr 20 16:04:33 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 13:04:33 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: >> >> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> > >> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. >> > >> > >> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: >> > https://github.com/PyTables/PyTables/issues/499 >> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html >> > >> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters. >> >> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics. > > Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should correspond to a string type, not np.string_ (which is really bytes). "... well enough with np.string's semantics [that h5py actually used it to pass data in and out; whether that array is fit for purpose beyond that, I won't comment]." :-) -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From hodge at stsci.edu Thu Apr 20 16:16:40 2017 From: hodge at stsci.edu (Phil Hodge) Date: Thu, 20 Apr 2017 16:16:40 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On 04/20/2017 03:17 PM, Anne Archibald wrote: > Actually if I understood the spec, FITS header lines are 80 bytes long > and contain ASCII with no NULLs; strings are quoted and trailing > spaces are stripped. > FITS BINTABLE extensions can have columns containing strings, and in that case the values are NULL terminated, except that if the string fills the field (i.e. there's no room for a NULL), the NULL will not be written. Phil From robert.kern at gmail.com Thu Apr 20 18:20:32 2017 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 20 Apr 2017 15:20:32 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: Message-ID: On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge wrote: > > On 04/20/2017 03:17 PM, Anne Archibald wrote: >> >> Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. > > FITS BINTABLE extensions can have columns containing strings, and in that case the values are NULL terminated, except that if the string fills the field (i.e. there's no room for a NULL), the NULL will not be written. Ah, that's what I was thinking of, thank you. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Apr 21 13:11:44 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 21 Apr 2017 11:11:44 -0600 Subject: [Numpy-discussion] __array_ufunc__ final review Message-ID: Hi All, The __array_ufunc__ PR is ready for final review. If there are no complaints, I plan to put it in tomorrow. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Apr 21 14:34:26 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 21 Apr 2017 11:34:26 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <8741041756854148453@unknownmsgid> References: <8741041756854148453@unknownmsgid> Message-ID: I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts: 1) most of it is focused on utf-8 vs utf-16. And that is a strong argument -- utf-16 is the worst of both worlds. 2) it isn't really addressing how to deal with fixed-size string storage as needed by numpy. It does bring up Python's current approach to Unicode: """ This lead to software design decisions such as Python?s string O(1) code point access. The truth, however, is that Unicode is inherently more complicated and there is no universal definition of such thing as *Unicode character*. We see no particular reason to favor Unicode code points over Unicode grapheme clusters, code units or perhaps even words in a language for that. """ My thoughts on that-- it's technically correct, but practicality beats purity, and the character concept is pretty darn useful for at least some (commonly used in the computing world) languages. In any case, whether the top-level API is character focused doesn't really have a bearing on the internal encoding, which is very much an implementation detail in py 3 at least. And Python has made its decision about that. So what are the numpy use-cases? I see essentially two: 1) Use with/from Python -- both creating and working with numpy arrays. In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...). However, there is a challenge here: numpy requires fixed-number-of-bytes dtypes. And full unicode support with fixed number of bytes matching fixed number of characters is only possible with UCS-4 -- hence the current implementation. And this is actually just fine! I know we all want to be efficient with data storage, but really -- in the early days of Unicode, when folks thought 16 bits were enough, doubling the memory usage for western language storage was considered fine -- how long in computer life time does it take to double your memory? But now, when memory, disk space, bandwidth, etc, are all literally orders of magnitude larger, we can't handle a factor of 4 increase in "wasted" space? Alternatively, Robert's suggestion of having essentially an object array, where the objects were known to be python strings is a pretty nice idea -- it gives the full power of python strings, and is a perfect one-to-one match with the python text data model. But as scientific text data often is 1-byte compatible, a one-byte-per-char dtype is a fine idea, too -- and we pretty much have that already with the existing string type -- that could simply be enhanced by enforcing the encoding to be latin-9 (or latin-1, if you don't want the Euro symbol). This would get us what scientists expect from strings in a way that is properly compatible with Python's string type. You'd get encoding errors if you tried to stuff anything else in there, and that's that. Yes, it would have to be a "new" dtype for backwards compatibility. 2) Interchange with other systems: passing the raw binary data back and forth between numpy arrays and other code, written in C, Fortran, or binary flle formats. This is a key use-case for numpy -- I think the key to its enormous success. But how important is it for text? Certainly any data set I've ever worked with has had gobs of binary numerical data, and a small smattering of text. So in that case, if, for instance, h5py had to encode/decode text when transferring between HDF files and numpy arrays, I don't think I'd ever see the performance hit. As for code complexity -- it would mean more complex code in interface libs, and less complex code in numpy itself. (though numpy could provide utilities to make it easy to write the interface code) If we do want to support direct binary interchange with other libs, then we should probably simply go for it, and support any encoding that Python supports -- as long as you are dealing with multiple encodings, why try to decide up front which ones to support? But how do we expose this to numpy users? I still don't like having non-fixed-width encoding under the hood, but what can you do? Other than that, having the encoding be a selectable part of the dtype works fine -- and in that case the number of bytes should be the "length" specifier. This, however, creates a bit of an impedance mismatch between the "character-focused" approach of the python string type. And requires the user to understand something about the encoding in order to even know how many bytes they need -- a utf-8-100 string will hold a different "length" of string than a utf-16-100 string. So -- I think we should address the use-cases separately -- one for "normal" python use and simple interoperability with python strings, and one for interoperability at the binary level. And an easy way to convert between the two. For Python use -- a pointer to a Python string would be nice. Then use a native flexible-encoding dtype for everything else. Thinking out loud -- another option would be to set defaults for the multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility with the python string type -- and make folks make an effort to get anything else. One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error: EncodingError if it can't be encoded into the defined encoding. ValueError if it is too long -- it should not be silently truncated. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Apr 21 17:34:27 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 21 Apr 2017 14:34:27 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker wrote: > 1) Use with/from Python -- both creating and working with numpy arrays. > > In this case, we want something compatible with Python's string (i.e. full > Unicode supporting) and I think should be as transparent as possible. > Python's string has made the decision to present a character oriented API > to users (despite what the manifesto says...). > Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored. > However, there is a challenge here: numpy requires fixed-number-of-bytes > dtypes. And full unicode support with fixed number of bytes matching fixed > number of characters is only possible with UCS-4 -- hence the current > implementation. And this is actually just fine! I know we all want to be > efficient with data storage, but really -- in the early days of Unicode, > when folks thought 16 bits were enough, doubling the memory usage for > western language storage was considered fine -- how long in computer life > time does it take to double your memory? But now, when memory, disk space, > bandwidth, etc, are all literally orders of magnitude larger, we can't > handle a factor of 4 increase in "wasted" space? > Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage. > But as scientific text data often is 1-byte compatible, a > one-byte-per-char dtype is a fine idea, too -- and we pretty much have that > already with the existing string type -- that could simply be enhanced by > enforcing the encoding to be latin-9 (or latin-1, if you don't want the > Euro symbol). This would get us what scientists expect from strings in a > way that is properly compatible with Python's string type. You'd get > encoding errors if you tried to stuff anything else in there, and that's > that. > I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. So -- I think we should address the use-cases separately -- one for > "normal" python use and simple interoperability with python strings, and > one for interoperability at the binary level. And an easy way to convert > between the two. > > For Python use -- a pointer to a Python string would be nice. > Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information. Then use a native flexible-encoding dtype for everything else. > No opposition here from me. Though again, I think utf-8 alone would also be enough. > Thinking out loud -- another option would be to set defaults for the > multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility > with the python string type -- and make folks make an effort to get > anything else. > The np.unicode_ type is already UCS-4 and the default for dtype=str on Python 3. We probably shouldn't change that, but if we set any default encoding for the new text type, I strongly believe it should be utf-8. One more note: if a user tries to assign a value to a numpy string array > that doesn't fit, they should get an error: > > EncodingError if it can't be encoded into the defined encoding. > > ValueError if it is too long -- it should not be silently truncated. > I think we all agree here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Apr 24 13:04:53 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 24 Apr 2017 10:04:53 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > In this case, we want something compatible with Python's string (i.e. full >> Unicode supporting) and I think should be as transparent as possible. >> Python's string has made the decision to present a character oriented API >> to users (despite what the manifesto says...). >> > > Yes, but NumPy doesn't really implement string operations, so fortunately > this is pretty irrelevant to us -- except for our API for specifying dtype > size. > Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does: arr[i] = a_string Which then raises a ValueError, something like: String too long for a string[12] dytype array. When len(a_string) <= 12 AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests. So folks need to do something like: len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either. > We already have strong precedence for dtypes reflecting number of bytes > used for storage even when Python doesn't: consider numeric types like > int64 and float32 compared to the Python equivalents. It's an intrinsic > aspect of NumPy that users need to think about how their data is actually > stored. > sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types. Storage cost is always going to be a concern. Arguably, it's even more of a >> concern today than it used to be be, because compute has been improving >> faster than storage. >> > sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to. I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant. > I still don't understand why a latin encoding makes sense as a preferred > one-byte-per-char dtype. The world, including Python 3, has standardized on > UTF-8, which is also one-byte-per-char for (ASCII) scientific data. > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context. latin-1 or latin-9 buys you (over ASCII): - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better. - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...) - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. For Python use -- a pointer to a Python string would be nice. >> > > Yes, absolutely. If we want to be really fancy, we could consider a > parametric object dtype that allows for object arrays of *any* homogeneous > Python type. Even if NumPy itself doesn't do anything with that > information, there are lots of use cases for that information. > hmm -- that's nifty idea -- though I think strings could/should be special cased. > Then use a native flexible-encoding dtype for everything else. >> > > No opposition here from me. Though again, I think utf-8 alone would also > be enough. > maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway. One more note: if a user tries to assign a value to a numpy string array >> that doesn't fit, they should get an error: >> > >> EncodingError if it can't be encoded into the defined encoding. >> >> ValueError if it is too long -- it should not be silently truncated. >> > > I think we all agree here. > I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want? The current 'S' dtype truncates silently already: In [6]: arr Out[6]: array(['this', 'that'], dtype='|S4') In [7]: arr[0] = "a longer string" In [8]: arr Out[8]: array(['a lo', 'that'], dtype='|S4') (similarly for the unicode type) So at least we are used to that. BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal. But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good. Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes? It all comes down to this: Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Apr 24 13:51:55 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 24 Apr 2017 10:51:55 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > latin-1 or latin-9 buys you (over ASCII): > > ... > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garbage, but you won't get an EncodingError. > For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it (see Python 2 vs Python 3 strings). Certainly, I would argue that nobody should write data in latin-1 unless they're doing so for the sake of a legacy application. I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data. On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes. > Then use a native flexible-encoding dtype for everything else. >>> >> >> No opposition here from me. Though again, I think utf-8 alone would also >> be enough. >> > > maybe so -- the major reason for supporting others is binary data exchange > with other libraries -- but maybe most of them have gone to utf-8 anyway. > Indeed, it would be helpful for this discussion to know what other encodings are actually currently used by scientific applications. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and "unknown". The current 'S' dtype truncates silently already: > One advantage of a new (non-default) dtype is that we can change this behavior. > Also -- if utf-8 is the default -- what do you get when you create an > array from a python string sequence? Currently with the 'S' and 'U' dtypes, > the dtype is set to the longest string passed in. Are we going to pad it a > bit? stick with the exact number of bytes? > It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text. We can keep bytes/str mapped to the current choices. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Mon Apr 24 13:51:55 2017 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 24 Apr 2017 13:51:55 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > > >> In this case, we want something compatible with Python's string (i.e. >>> full Unicode supporting) and I think should be as transparent as possible. >>> Python's string has made the decision to present a character oriented API >>> to users (despite what the manifesto says...). >>> >> >> Yes, but NumPy doesn't really implement string operations, so fortunately >> this is pretty irrelevant to us -- except for our API for specifying dtype >> size. >> > > Exactly -- the character-orientation of python strings means that people > are used to thinking that strings have a length that is the number of > characters in the string. I think there will a cognitive dissonance if > someone does: > > arr[i] = a_string > > Which then raises a ValueError, something like: > > String too long for a string[12] dytype array. > > When len(a_string) <= 12 > > AND that will only occur if there are non-ascii characters in the string, > and maybe only if there are more than N non-ascii characters. i.e. it is > very likely to be a run-time error that may not have shown up in tests. > > So folks need to do something like: > > len(a_string.encode('utf-8')) to see if their string will fit. If not, > they need to truncate it, and THAT is non-obvious how to do, too -- you > don't want to truncate the encodes bytes naively, you could end up with an > invalid bytestring. but you don't know how many characters to truncate, > either. > > >> We already have strong precedence for dtypes reflecting number of bytes >> used for storage even when Python doesn't: consider numeric types like >> int64 and float32 compared to the Python equivalents. It's an intrinsic >> aspect of NumPy that users need to think about how their data is actually >> stored. >> > > sure, but a float64 is 64 bytes forever an always and the defaults > perfectly match what python is doing under its hood --even if users don't > think about. So the default behaviour of numpy matched python's built-in > types. > > > Storage cost is always going to be a concern. Arguably, it's even more of >>> a concern today than it used to be be, because compute has been improving >>> faster than storage. >>> >> > sure -- but again, what is the use-case for numpy arrays with a s#$)load > of text in them? common? I don't think so. And as you pointed out numpy > doesn't do text processing anyway, so cache performance and all that are > not important. So having UCS-4 as the default, but allowing folks to select > a more compact format if they really need it is a good way to go. Just like > numpy generally defaults to float64 and Int64 (or 32, depending on > platform) -- users can select a smaller size if they have a reason to. > > I guess that's my summary -- just like with numeric values, numpy should > default to Python-like behavior as much as possible for strings, too -- > with an option for a knowledgeable user to do something more performant. > > >> I still don't understand why a latin encoding makes sense as a preferred >> one-byte-per-char dtype. The world, including Python 3, has standardized on >> UTF-8, which is also one-byte-per-char for (ASCII) scientific data. >> > > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your > data are one-byte per char, then you could use ASCII, and it would be > binary compatible with utf-8, but not sure what the point of that is in > this context. > > latin-1 or latin-9 buys you (over ASCII): > > - A bunch of accented characters -- sure it only covers the latin > languages, but does cover those much better. > > - A handful of other characters, including scientifically useful ones. (a > few greek characters, the degree symbol, etc...) > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garbage, but you won't get an EncodingError. > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. > > For Python use -- a pointer to a Python string would be nice. >>> >> >> Yes, absolutely. If we want to be really fancy, we could consider a >> parametric object dtype that allows for object arrays of *any* homogeneous >> Python type. Even if NumPy itself doesn't do anything with that >> information, there are lots of use cases for that information. >> > > hmm -- that's nifty idea -- though I think strings could/should be special > cased. > > >> Then use a native flexible-encoding dtype for everything else. >>> >> >> No opposition here from me. Though again, I think utf-8 alone would also >> be enough. >> > > maybe so -- the major reason for supporting others is binary data exchange > with other libraries -- but maybe most of them have gone to utf-8 anyway. > > One more note: if a user tries to assign a value to a numpy string array >>> that doesn't fit, they should get an error: >>> >> >>> EncodingError if it can't be encoded into the defined encoding. >>> >>> ValueError if it is too long -- it should not be silently truncated. >>> >> >> I think we all agree here. >> > > I'm actually having second thoughts -- see above -- if the encoding is > utf-8, then truncating is non-trivial -- maybe it would be better for numpy > to do it for you. Or set a flag as to which you want? > > The current 'S' dtype truncates silently already: > > In [6]: arr > > Out[6]: > array(['this', 'that'], > dtype='|S4') > > In [7]: arr[0] = "a longer string" > > In [8]: arr > > Out[8]: > array(['a lo', 'that'], > dtype='|S4') > > (similarly for the unicode type) > > So at least we are used to that. > > BTW -- maybe we should keep the pathological use-case in mind: really > short strings. I think we are all thinking in terms of longer strings, > maybe a name field, where you might assign 32 bytes or so -- then someone > has an accented character in their name, and then ge30 or 31 characters -- > no big deal. > I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings. I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world. BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm! https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html - Tom > > > But what if you have a simple label or something with 1 or two characters: > Then you have 2 bytes to store the name in, and someone tries to put an > "odd" character in there, and you get an empty string. not good. > > Also -- if utf-8 is the default -- what do you get when you create an > array from a python string sequence? Currently with the 'S' and 'U' dtypes, > the dtype is set to the longest string passed in. Are we going to pad it a > bit? stick with the exact number of bytes? > > It all comes down to this: > > Python3 has made a very deliberate (and I think Good) choice to treat text > as a string of characters, where the user does not need to know or care > about encoding issues. Numpy's defaults should do the same thing. > > -CHB > > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Apr 24 14:13:51 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 24 Apr 2017 11:13:51 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer wrote: > - round-tripping of binary data (at least with Python's encoding/decoding) >> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the >> same bytes back. You may get garbage, but you won't get an EncodingError. >> > > For a new application, it's a good thing if a text type breaks when you to > stuff arbitrary bytes in it > maybe, maybe not -- the application may be new, but the data it works with may not be. > (see Python 2 vs Python 3 strings). > this is exactly why py3 strings needed to add the "surrogateescape" error handler: https://www.python.org/dev/peps/pep-0383 sometimes text and binary data are mixed, sometimes encoded text is broken. It is very useful to be able to pass such data through strings losslessly. Certainly, I would argue that nobody should write data in latin-1 unless > they're doing so for the sake of a legacy application. > or you really want that 1-byte per char efficiency > I do understand the value in having some "string" data type that could be > used by default by loaders for legacy file formats/applications (i.e., > netCDF3) that support unspecified "one byte strings." Then you're a few > short calls away from viewing (i.e., array.view('text[my_real_encoding]'), > if we support arbitrary encodings) or decoding (i.e., > np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the > proper encoding. It's not realistic to expect users to know the true > encoding for strings from a file before they even look at the data. > except that you really should :-( On the other hand, if this is the use-case, perhaps we really want an > encoding closer to "Python 2" string, i.e, "unknown", to let this be > signaled more explicitly. I would suggest that "text[unknown]" should > support operations like a string if it can be decoded as ASCII, and > otherwise error. But unlike "text[ascii]", it will let you store arbitrary > bytes. > I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and > "unknown". > hmm -- "unknown" should be bytes, not text. If the user needs to look at it first, then load it as bytes, run chardet or something on it, then cast to the right encoding. The current 'S' dtype truncates silently already: >> > > One advantage of a new (non-default) dtype is that we can change this > behavior. > yeah -- still on the edge about that, at least with variable-size encodings. It's hard to know when it's going to happen and it's hard to know what to do when it does. At least if if truncates silently, numpy can have the code to do the truncation properly. Maybe an option? And the numpy numeric types truncate (Or overflow) already. Again: If the default string handling matches expectations from python strings, then the specialized ones can be more buyer-beware. Also -- if utf-8 is the default -- what do you get when you create an array >> from a python string sequence? Currently with the 'S' and 'U' dtypes, the >> dtype is set to the longest string passed in. Are we going to pad it a bit? >> stick with the exact number of bytes? >> > > It might be better to avoid this for now, and force users to be explicit > about encoding if they use the dtype for encoded text. > yup. And we really should have a bytes type for py3 -- which we do, it's just called 'S', which is pretty confusing :-) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Apr 24 14:21:48 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 24 Apr 2017 11:21:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > BTW -- maybe we should keep the pathological use-case in mind: really >> short strings. I think we are all thinking in terms of longer strings, >> maybe a name field, where you might assign 32 bytes or so -- then someone >> has an accented character in their name, and then ge30 or 31 characters -- >> no big deal. >> > > I wouldn't call it a pathological use case, it doesn't seem so uncommon to > have large datasets of short strings. > It's pathological for using a variable-length encoding. > I personally deal with a database of hundreds of billions of 2 to 5 > character ASCII strings. This has been a significant blocker to Python 3 > adoption in my world. > I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings... BTW, for those new to the list or with a short memory, this topic has been > discussed fairly extensively at least 3 times before. Hopefully the > *fourth* time will be the charm! > yes, let's hope so! The big difference now is that Julian seems to be committed to actually making it happen! Thanks Julian! Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so. I have strong opinions, but would still rather see any of the ideas on the table implemented than nothing. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 14:36:15 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 11:36:15 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: >>> >>> BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal. >> >> >> I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings. > > It's pathological for using a variable-length encoding. > >> I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world. > > I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings... Unless if you have hundreds of billions of them. >> BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm! > > yes, let's hope so! > > The big difference now is that Julian seems to be committed to actually making it happen! > > Thanks Julian! > > Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so. > > I have strong opinions, but would still rather see any of the ideas on the table implemented than nothing. FWIW, I prefer nothing to just adding a special case for latin-1. Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 14:47:21 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 11:47:21 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. > > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Mon Apr 24 14:56:55 2017 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 24 Apr 2017 14:56:55 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < > aldcroft at head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker > wrote: > > >> - round-tripping of binary data (at least with Python's > encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and > re-encoded to get the same bytes back. You may get garbage, but you won't > get an EncodingError. > > > > +1. The key point is that there is a HUGE amount of legacy science data > in the form of FITS (astronomy-specific binary file format that has been > the primary file format for 20+ years) and HDF5 which uses a character data > type to store data which can be bytes 0-255. Getting an decoding/encoding > error when trying to deal with these datasets is a non-starter from my > perspective. > > That says to me that these are properly represented by `bytes` objects, > not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 > encoding. > If you could go back 30 years and get every scientist in the world to do the right thing, then sure. But we are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters. So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution. - Tom > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 15:09:06 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 12:09:06 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > >>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...). >> >> >> Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. > > Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does: > > arr[i] = a_string > > Which then raises a ValueError, something like: > > String too long for a string[12] dytype array. We have the freedom to make the error message not suck. :-) > When len(a_string) <= 12 > > AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests. > > So folks need to do something like: > > len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either. If this becomes the right strategy for dealing with these problems (and I'm not sure that it is), we can easily make a utility function that does this for people. This discussion is why I want to be sure that we have our use cases actually mapped out. For this kind of in-memory manipulation, I'd use an object array (a la pandas), then convert to the uniform-width string dtype when I needed to push this out to a C API, HDF5 file, or whatever actually requires a string-dtype array. The required width gets computed from the data after all of the manipulations are done. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1! >> I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. > > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context. > > latin-1 or latin-9 buys you (over ASCII): > > - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better. > > - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...) > > - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 16:06:06 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 13:06:06 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> >> >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. >> > >> > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. >> >> That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. > > If you could go back 30 years and get every scientist in the world to do the right thing, then sure. But we are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters. I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? > So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution. Well, I outlined a solution: work with `bytes` arrays with utilities to convert to/from the Unicode-aware string dtypes (or `object`). A UTF-8-specific dtype and maybe a string-specialized `object` dtype address the very real and consequential problems that I face (namely and respectively, working with HDF5 and in-memory manipulation of string datasets). I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have- been-warned-you're-gonna-get-mojibake option. It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Apr 24 17:00:13 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 24 Apr 2017 14:00:13 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern wrote: > > I agree -- it is a VERY common case for scientific data sets. But a > one-byte-per-char encoding would handle it nicely, or UCS-4 if you want > Unicode. The wasted space is not that big a deal with short strings... > > Unless if you have hundreds of billions of them. > Which is why a one-byte-per char encoding is a good idea. Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) > I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively? > or leave it be until someone else is willing to solve that problem. I > don't think we're at the bikeshedding stage yet; we're still disagreeing > about fundamental requirements. > yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing. So here I'll lay out what I think are the fundamental requirements: 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do: arr = np.array(("this", "that",)) you get an array that can store ANY unicode string with 4 or less characters and arr[1] will return a native Python string object. 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data". arr = np.array(("this", "that",), dtype=np.single_byte_string) (name TBD) and arr[1] would return a python string. attempting to put in a not-compatible with the encoding string in would raise an Encoding Error. I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case. 3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object, and returns a bytes object. - you could use astype() to convert between bytes and a specified encoding with no change in binary representation. 2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes. 1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features. > +1. The key point is that there is a HUGE amount of legacy science data > in the form of FITS (astronomy-specific binary file format that has been > the primary file format for 20+ years) and HDF5 which uses a character data > type to store data which can be bytes 0-255. Getting an decoding/encoding > error when trying to deal with these datasets is a non-starter from my > perspective. That says to me that these are properly represented by `bytes` objects, not > `unicode/str` objects encoding to and decoding from a hardcoded latin-1 > encoding. Well, yes -- BUT: That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string formatting, surrogate_escape, and other features that facilitate working with messy almost text. Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error. > - round-tripping of binary data (at least with Python's > encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and > re-encoded to get the same bytes back. You may get garbage, but you won't > get an EncodingError. > But what if the format I'm working with specifies another encoding? Am I > supposed to encode all of my Unicode strings in the specified encoding, > then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a > really important use case for me. latin-1 would be only for the special case of mostly-ascii (or true latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too. As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and moving it out of an array. utf-8 is a very important encoding -- I just think it's the wrong one for the default interplay with python strings. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype > will always have this kind of problem. You should only put up with it if > you have a requirement to write to a format that specifies the width and > the encoding. That specified encoding is frequently not latin-1! > of course not -- if you are writing to a format that specifies a width and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for. > - round-tripping of binary data (at least with Python's > encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and > re-encoded to get the same bytes back. You may get garbage, but you won't > get an EncodingError. > > But what if the format I'm working with specifies another encoding? Am I > supposed to encode all of my Unicode strings in the specified encoding, > then decode as latin-1 to assign into my array? of course not -- see above. I'm happy to consider a latin-1-specific dtype as a second, > workaround-for-specific-applications-only-you-have-been- > warned-you're-gonna-get-mojibake option. well, it wouldn't create mojibake - anything that went from a python string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to: - raise an error - alter the data (error-"replace") - pass the corrupted data on through. but it could deal with mojibake -- that's the whole point :-) > It should not be *the* Unicode string dtype (i.e. named np.realstring or > np.unicode as in the original proposal). God no -- sorry if it looked like I was suggesting that. I only suggest that it might be *the* one-byte-per-char string type -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Mon Apr 24 19:06:56 2017 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 24 Apr 2017 19:06:56 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: > I am not unfamiliar with this problem. I still work with files that have > fields that are supposed to be in EBCDIC but actually contain text in > ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit > encodings. In that experience, I have found that just treating the data as > latin-1 unconditionally is not a pragmatic solution. It's really easy to > implement, and you do get a program that runs without raising an exception > (at the I/O boundary at least), but you don't often get a program that > really runs correctly or treats the data properly. > > Can you walk us through the problems that you are having with working with > these columns as arrays of `bytes`? > This is very simple and obvious but I will state for the record. Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. In [22]: dat = np.array(['yes', 'no'], dtype='S3') In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool) - Tom [1]: Using h5py or pytables. Same with FITS, although astropy.io.fits does some tricks under the hood to auto-convert to `U` type as needed. > > > > So I would beg to actually move forward with a pragmatic solution that > addresses very real and consequential problems that we face instead of > waiting/praying for a perfect solution. > > Well, I outlined a solution: work with `bytes` arrays with utilities to > convert to/from the Unicode-aware string dtypes (or `object`). > > A UTF-8-specific dtype and maybe a string-specialized `object` dtype > address the very real and consequential problems that I face (namely and > respectively, working with HDF5 and in-memory manipulation of string > datasets). > > I'm happy to consider a latin-1-specific dtype as a second, > workaround-for-specific-applications-only-you-have-been- > warned-you're-gonna-get-mojibake option. It should not be *the* Unicode > string dtype (i.e. named np.realstring or np.unicode as in the original > proposal). > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 19:08:50 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 16:08:50 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern wrote: >> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) > > I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively? If the point is to have an array that transparently accepts/yields `unicode/str` scalars while maintaining the in-memory encoding, yes. If that's not the point, then IMO the status quo is fine, and *no* new dtypes should be added, just maybe some utility functions to convert between the bytes-ish arrays and the Unicode-holding arrays (which was one of my proposals). I am mostly happy to live in a world where I read in data as bytes-ish arrays, decode into `object` arrays holding `unicode/str` objects, do my manipulations, then encode the array into a bytes-ish array to give to the C API or file format. >> or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements. > > yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing. I understand, but not all tedious discussions that have not yet achieved consensus are bikeshedding to be cut short. We couldn't really decide what to do back in the pre-1.0 days, too, so we just did *something*, and that something is now the very situation that Julian has a problem with. We have more experience now, especially with the added wrinkles of Python 3; other projects have advanced and matured their Unicode string array-handling (e.g. pandas and HDF5); now is a great time to have a real discussion about what we *need* before we make decisions about what we should *do*. > So here I'll lay out what I think are the fundamental requirements: > > 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do: > > arr = np.array(("this", "that",)) > > you get an array that can store ANY unicode string with 4 or less characters > > and arr[1] will return a native Python string object. > > 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data". > > arr = np.array(("this", "that",), dtype=np.single_byte_string) > > (name TBD) > > and arr[1] would return a python string. > > attempting to put in a not-compatible with the encoding string in would raise an Encoding Error. > > I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case. > > 3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) > > 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object, and returns a bytes object. > - you could use astype() to convert between bytes and a specified encoding with no change in binary representation. You'll need to specify what NULL-terminating behavior you want here. np.string_ has NULL-termination. np.void (which could be made to work better with `bytes`) does not. Both have use-cases for text encoding (shakes fist at UTF-16). > 2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes. > > 1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features. > >> > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. > >> That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. > > Well, yes -- BUT: That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string formatting, surrogate_escape, and other features that facilitate working with messy almost text. Walk me through a problem that you've encountered with such textish data in arrays. I know the problems in Web protocol-land, but they are not really relevant to us. What are *your* problems? Why didn't those ameliorations that were added for the Web world address your problems? I really want to get at specific use cases that interact with numpy, not handwaving at problems other people have had in other contexts. > Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error. > >> > - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. >> But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me. > > latin-1 would be only for the special case of mostly-ascii (or true latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too. In my experience, it has both upside and downside. Silently creating mojibake is a problem. The process that you described, decoding ANY strings of bytes as latin-1, can create mojibake. The inverse, encoding then decoding, may not, but of course the encoding step there does not accept arbitrary Unicode strings. > As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and moving it out of an array. > > utf-8 is a very important encoding -- I just think it's the wrong one for the default interplay with python strings. > >> Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1! > > of course not -- if you are writing to a format that specifies a width and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for. Ah, your message was responding to Stephan who questioned why latin-1 should be the default encoding for the `unicode/str`-aware string dtype. It seemed like you were affirming that latin-1 ought to be that default. It seems like that is not your position, but you are defending the existence of a latin-1 dtype for specific uses. >> I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake option. > > well, it wouldn't create mojibake - anything that went from a python string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to: > > - raise an error > - alter the data (error-"replace") > - pass the corrupted data on through. > > but it could deal with mojibake -- that's the whole point :-) You are right that assigning a `unicode/str` object into my latin-1-dtype array would not create mojibake, but that's not the only way to fill a numpy array. In the context of my email, I was responding to a use case being floated for the latin-1 dtype that was to read existing FITS files that have fields that are text-ish: plain octets according to the file format standard, but in practice mostly ASCII with a few sparse high-bit characters typically from some unspecified iso-8859-* encoding. If that unspecified encoding wasn't latin-1, then I'm getting mojibake when I read the file (unless if, happy days, the author of the file was also using latin-1). I understand that you are proposing a latin-1 dtype in a context with other dtypes and tools that might make that use of the latin-1 dtype obsolete. However, there are others who have been proposing just a latin-1 dtype for this purpose. Let me make a counter-proposal for your latin-1 dtype (your #2) that might address your, Thomas's, and Julian's use cases: 2) We want a single-byte-per-character, NULL-terminated string dtype that can be used to represent mostly-ASCII textish data that may have some high-bit characters from some 8-bit encoding. It should be able to read arbitrary bytes (that is, up to the NULL-termination) and write them back out as the same bytes if unmodified. This lets us read this text from files where the encoding is unspecified (or is lying about the encoding) into `unicode/str` objects. The encoding is specified as `ascii` but the decoding/encoding is done with the `surrogateescape` option so that high-bit characters are faithfully represented in the `unicode/str` string but are not erroneously reinterpreted as other characters from an arbitrary encoding. I'd even be happy if Julian or someone wants to go ahead and implement this right now and leave the UTF-8 dtype for a later time. As long as this ASCII-surrogateescape dtype is not called np.realstring (it's *really* important to me that the bikeshed not be this color). ;-) -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Apr 24 19:09:48 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 24 Apr 2017 16:09:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: > On the other hand, if this is the use-case, perhaps we really want an >> encoding closer to "Python 2" string, i.e, "unknown", to let this be >> signaled more explicitly. I would suggest that "text[unknown]" should >> support operations like a string if it can be decoded as ASCII, and >> otherwise error. But unlike "text[ascii]", it will let you store arbitrary >> bytes. >> > > I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really > is ascii, then it's perfect. If it really is latin-*, then you get some > extra useful stuff, and if it's corrupted somehow, you still get the ascii > text correct, and the rest won't barf and can be passed on through. > I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters." My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data? I don't think that silently (mis)interpreting non-ASCII characters as latin-1/9 is a good idea, which is why I think it would be a mistake to use 'latin-1' for text data with unknown encoding. I could get behind a data type that compares equal to strings for ASCII only and allows for *storing* other characters, but making blind assumptions about characters 128-255 seems like a recipe for disaster. Imagine text[unknown] as a one character string type, but it supports .decode() like bytes and every character in the range 128-255 compares for equality with other characters like NaN -- not even equal to itself. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 19:11:25 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 16:11:25 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. >> >> Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? > > This is very simple and obvious but I will state for the record. I appreciate it. What is obvious to you is not obvious to me. > Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. > > In [22]: dat = np.array(['yes', 'no'], dtype='S3') > > In [23]: dat == 'yes' # FAIL (but works just fine in Py2) > Out[23]: False > > In [24]: dat == b'yes' # Right answer but not practical > Out[24]: array([ True, False], dtype=bool) I'm curious why you think this is not practical. It seems like a very practical solution to me. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Apr 24 19:19:16 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 24 Apr 2017 16:19:16 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Let me make a counter-proposal for your latin-1 dtype (your #2) that might > address your, Thomas's, and Julian's use cases: > > 2) We want a single-byte-per-character, NULL-terminated string dtype that > can be used to represent mostly-ASCII textish data that may have some > high-bit characters from some 8-bit encoding. It should be able to read > arbitrary bytes (that is, up to the NULL-termination) and write them back > out as the same bytes if unmodified. This lets us read this text from files > where the encoding is unspecified (or is lying about the encoding) into > `unicode/str` objects. The encoding is specified as `ascii` but the > decoding/encoding is done with the `surrogateescape` option so that > high-bit characters are faithfully represented in the `unicode/str` string > but are not erroneously reinterpreted as other characters from an arbitrary > encoding. > > I'd even be happy if Julian or someone wants to go ahead and implement > this right now and leave the UTF-8 dtype for a later time. > > As long as this ASCII-surrogateescape dtype is not called np.realstring > (it's *really* important to me that the bikeshed not be this color). ;-) > This sounds quite similar to my text[unknown] proposal, with the advantage that the concept of "surrogateescape" that already exists. Surrogate-escape characters compare equal to themselves, which is maybe less than ideal, but it looks like you can put them in real unicode strings, which is nice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 19:23:37 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 16:23:37 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: >>> >>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes. >> >> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through. > > I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters." > > My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data? I don't know that we can reasonably make that accounting relevant. Number of such characters per byte of text? Number of files with such characters out of all existing files? What I can say with assurance is that every time I have decided, as a developer, to write code that just hardcodes latin-1 for such cases, I have regretted it. While it's just personal anecdote, I think it's at least measuring the right thing. :-) -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From aldcroft at head.cfa.harvard.edu Mon Apr 24 20:56:43 2017 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Mon, 24 Apr 2017 20:56:43 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < > aldcroft at head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern > wrote: > >> > >> I am not unfamiliar with this problem. I still work with files that > have fields that are supposed to be in EBCDIC but actually contain text in > ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit > encodings. In that experience, I have found that just treating the data as > latin-1 unconditionally is not a pragmatic solution. It's really easy to > implement, and you do get a program that runs without raising an exception > (at the I/O boundary at least), but you don't often get a program that > really runs correctly or treats the data properly. > >> > >> Can you walk us through the problems that you are having with working > with these columns as arrays of `bytes`? > > > > This is very simple and obvious but I will state for the record. > > I appreciate it. What is obvious to you is not obvious to me. > > > Reading an HDF5 file with character data currently gives arrays of > `bytes` [1]. In Py3 this cannot be compared to a string literal, and > comparing to (or assigning from) explicit byte strings everywhere in the > code quickly spins out of control. This generally forces one to convert > the data to `U` type and incur the 4x memory bloat. > > > > In [22]: dat = np.array(['yes', 'no'], dtype='S3') > > > > In [23]: dat == 'yes' # FAIL (but works just fine in Py2) > > Out[23]: False > > > > In [24]: dat == b'yes' # Right answer but not practical > > Out[24]: array([ True, False], dtype=bool) > > I'm curious why you think this is not practical. It seems like a very > practical solution to me. > In Py3 most character data will be string, not bytes. So every time you want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate. - Tom > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 21:36:46 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 18:36:46 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcroft at head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. >> >> >> >> Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? >> > >> > This is very simple and obvious but I will state for the record. >> >> I appreciate it. What is obvious to you is not obvious to me. >> >> > Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. >> > >> > In [22]: dat = np.array(['yes', 'no'], dtype='S3') >> > >> > In [23]: dat == 'yes' # FAIL (but works just fine in Py2) >> > Out[23]: False >> > >> > In [24]: dat == b'yes' # Right answer but not practical >> > Out[24]: array([ True, False], dtype=bool) >> >> I'm curious why you think this is not practical. It seems like a very practical solution to me. > > In Py3 most character data will be string, not bytes. So every time you want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. Okay, so the problem isn't with (byte-)string literals, but with variables being passed around from other sources. Eg. def func(dat, scalar): return dat == scalar Every one of those functions deepens the abstraction and moves that unicode-by-default scalar farther away from the bytesish array, so it's harder to demand that users of those functions be aware that they need to pass in `bytes` strings. So you need to implement those functions defensively, which complicates them. > The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate. What do you think about my ASCII-surrogateescape proposal? Do you think that would work with your use cases? In general, I don't think Unicode sandwiches will be eliminated by this or the latin-1 dtype; the sandwich is usually the right thing to do and the surrogateescape the wrong thing. But I'm keenly aware of the problems you get when there just isn't a reliable encoding to use. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Apr 24 22:07:23 2017 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 24 Apr 2017 19:07:23 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Apr 21, 2017 2:34 PM, "Stephan Hoyer" wrote: I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. You may already know this, but probably not everyone reading does: the reason why latin1 often gets special attention in discussions of Unicode encoding is that latin1 is effectively "ucs1". It's the unique one byte text encoding where byte N represents codepoint U+N. I can't think of any reason why this property is particularly important for numpy's usage, because we always have a conversion step anyway to get data in and out of an array. The potential arguments for latin1 that I can think of are: - if we have to implement our own en/decoding code for some reason then it's the most trivial encoding - if other formats standardize on latin1-with-nul-padding and we want in-memory/mmap compatibility - if we really want a fixed width encoding for some reason but don't care which one, then it's in some sense the most obvious choice I can't think of many reasons why having a fixed width encoding is particularly important though... For our current style of string storage, even calculating the length of a string is O(n), and AFAICT the only way to actually take advantage of the theoretical O(1) character indexing is to make a uint8 view. I guess it would be useful if we had a string slicing ufunc... But why would we? That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically. >From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 22:23:55 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 19:23:55 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically. > > From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't. Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones. https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Apr 24 22:41:31 2017 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 24 Apr 2017 19:41:31 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > >> That said, AFAICT what people actually want in most use cases is support >> for arrays that can hold variable-length strings, and the only place where >> the current approach is *optimal* is when we need mmap compatibility with >> legacy formats that use fixed-width-nul-padded fields (at which point it's >> super convenient). It's not even possible to *represent* all Python strings >> or bytestrings in current numpy unicode or string arrays (Python >> strings/bytestrings can have trailing nuls). So if we're talking about >> tweaks to the current system it probably makes sense to focus on this use >> case specifically. >> >> From context I'm assuming FITS files use fixed-width-nul-padding for >> strings? Is that right? I know HDF5 doesn't. > > Yes, HDF5 does. Or at least, it is supported in addition to the > variable-length ones. > > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html Doh, I found that page but it was (and is) meaningless to me, so I went by http://docs.h5py.org/en/latest/strings.html, which says the options are fixed-width ascii, variable-length ascii, or variable-length utf-8 ... I guess it's just talking about what h5py currently supports. But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-). Is it important for some other reason? Also, further searching suggests that HDF5 actually supports all of nul termination, nul padding, and space padding, and that nul termination is the default? How much does it help to have in-memory compatibility with just one of these options (and not even the default one)? Would we need to add the other options to be really useful for HDF5? (Unlikely to happen within numpy itself, but potentially something that could be done inside h5py or whatever if numpy's user-defined dtype system were a little more useful.) -n [1] hope -- Nathaniel J. Smith -- https://vorpus.org From shoyer at gmail.com Mon Apr 24 23:01:48 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 24 Apr 2017 20:01:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > But also, is it important whether strings we're loading/saving to an > HDF5 file have the same in-memory representation in numpy as they > would in the file? I *know* [1] no-one is reading HDF5 files using > np.memmap :-). Of course they do :) https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682 > Also, further searching suggests that HDF5 actually supports all of > nul termination, nul padding, and space padding, and that nul > termination is the default? How much does it help to have in-memory > compatibility with just one of these options (and not even the default > one)? Would we need to add the other options to be really useful for > HDF5? h5py actually ignores this option and only uses null termination. I have not heard any complaints about this (though I have heard complaints about the lack of fixed-length UTF-8). But more generally, you're right. h5py doesn't need a corresponding NumPy dtype for each HDF5 string dtype, though that would certainly be *convenient*. In fact, it already (ab)uses NumPy's dtype metadata with h5py.special_dtype to indicate a homogeneous string type for object arrays. I would guess h5py users have the same needs for efficient string representations (including surrogate-escape options) as other scientific users. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Apr 24 23:07:33 2017 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 24 Apr 2017 20:07:33 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > > On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > > > >> That said, AFAICT what people actually want in most use cases is support > >> for arrays that can hold variable-length strings, and the only place where > >> the current approach is *optimal* is when we need mmap compatibility with > >> legacy formats that use fixed-width-nul-padded fields (at which point it's > >> super convenient). It's not even possible to *represent* all Python strings > >> or bytestrings in current numpy unicode or string arrays (Python > >> strings/bytestrings can have trailing nuls). So if we're talking about > >> tweaks to the current system it probably makes sense to focus on this use > >> case specifically. > >> > >> From context I'm assuming FITS files use fixed-width-nul-padding for > >> strings? Is that right? I know HDF5 doesn't. > > > > Yes, HDF5 does. Or at least, it is supported in addition to the > > variable-length ones. > > > > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html > > Doh, I found that page but it was (and is) meaningless to me, so I > went by http://docs.h5py.org/en/latest/strings.html, which says the > options are fixed-width ascii, variable-length ascii, or > variable-length utf-8 ... I guess it's just talking about what h5py > currently supports. It's okay, I made exactly the same mistake earlier in the thread. :-) > But also, is it important whether strings we're loading/saving to an > HDF5 file have the same in-memory representation in numpy as they > would in the file? I *know* [1] no-one is reading HDF5 files using > np.memmap :-). Is it important for some other reason? The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't. https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/624 -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 12:01:05 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 09:01:05 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Chris, you've mashed all of my emails together, some of them are in reply > to you, some in reply to others. Unfortunately, this dropped a lot of the > context from each of them, and appears to be creating some > misunderstandings about what each person is advocating. > Sorry about that -- I was trying to keep an already really long thread from getting eve3n longer.... And I'm not sure it matters who's doing the advocating, but rather *what* is being advocated -- I hope I didn't screw that up too badly. Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. So I'll try again -- use-case only! we'll keep the possible solutions separate. Do we need to write up a NEP for this? it seems we are going a bit in circles, and we really do want to capture the final decision process. 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:: arr = np.array(("this", "that",)) you get an array that can store ANY unicode string with 4 or less characters. and arr[1] will return a native Python3 string object. This is the use-case for "casual" numpy users -- not the folks writing H5py and the like, or the ones writing Cython bindings to C++ libs. 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not to be wasting space for "typical european-language-oriented data". Note: this should ALSO be compatible with Python's character-oriented string model. i.e. a Python String with length N will fit into a dtype of size N. arr = np.array(("this", "that",), dtype=np.single_byte_string) (name TBD) and arr[1] would return a python string. attempting to put in a not-compatible with the encoding String would raise an EncodingError. This is also a use-case primarily for "casual" users -- but ones concerned with the size of the data storage and know that are using european text. 3) dtypes that support storage in particular encodings: Python strings would be encoded appropriately when put into the array. A Python string would be returned when indexing. a) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) at the binary level. b) There be a dtype that could store data in any encoding supported by Python -- to facilitate bytes-level interchange with other systems. If we need more than utf-8, then we might as well have the full set. 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object (or other memoryview?), and returns a bytes object. You could use astype() to convert between bytes and a specified encoding with no change in binary representation. This could be used to store any binary data, including encoded text or anything else. this should map directly to the Python bytes model -- thus NOT null-terminted. This is a little different than 'S' behaviour on py3 -- it appears that with 'S', a if ALL the trailing bytes are null, then it is truncated, but if there is a null byte in the middle, then it is preserved. I suspect that this is a legacy from Py2's use of "strings" as both text and binary data. But in py3, a "bytes" type should be about bytes, and not text, and thus null-values bytes are simply another value a byte can hold. There are multiple ways to address these use cases -- please try to make your comments clear about whether you think the use-case is unimportant, or ill-defined, or if you think a given solution is a poor choice. To facilitate that, I will put my comments on possible solutions in a separate note, too. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 12:34:46 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 09:34:46 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: This is essentially my rant about use-case (2): A compact dtype for mostly-ascii text: On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker > wrote: > >> On the other hand, if this is the use-case, perhaps we really want an >>> encoding closer to "Python 2" string, i.e, "unknown", to let this be >>> signaled more explicitly. I would suggest that "text[unknown]" should >>> support operations like a string if it can be decoded as ASCII, and >>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary >>> bytes. >>> >> >> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it >> really is ascii, then it's perfect. If it really is latin-*, then you get >> some extra useful stuff, and if it's corrupted somehow, you still get the >> ascii text correct, and the rest won't barf and can be passed on through. >> > > I am totally in agreement with Thomas that "We are living in a messy > world right now with messy legacy datasets that have character type data > that are *mostly* ASCII, but not infrequently contain non-ASCII characters." > > My question: What are those non-ASCII characters? How often are they truly > latin-1/9 vs. some other text encoding vs. non-string binary data? > I am totally euro-centric, but as I understand it, that is the whole point of the desire for a compact one-byte-per character encoding. If there is a strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we should support that. But this all started with "mostly ascii". My take on that is: We don't want to use pure-ASCII -- that is the hell that python2's default encoding approach led to -- it is MUCH better to pass garbage through than crash out with an EncodingError -- data are messy, and people are really bad at writing comprehensive tests. So we need something that handles ASCII properly, and can pass trhough arbitrary bytes as well without crashing. Options are: * ASCII With errors='ignore' or 'replace' I think that is a very bad idea -- it is tossing away information that _may_ have some use eslewhere:: s = arr[i] arr[i] = s should put the same bytes back into the array. * ASCII with errors='surrogateescape' This would preserve bytes and not crash out, so meets the key criteria. * latin-1 This would do the exactly correct thing for ASCII, preserve the bytes, and not crash out. But it would also allow additional symbols useful to european languages and scientific computing. Seems like a win-win to me. As for my use-cases: - Messy data: I have had a lot of data sets with european text in them, mostly ASCII and an occasional non ASCII accented character or symbol -- most of these come from legacy systems, and have an ugly arbitrary combination of MacRoman, Win-something-or-other, and who knows what -- i.e. mojibake, though at least mostly ascii. The only way to deal with it "properly" is to examine each string and try to figure out which encoding it is in, hope at least a single string is in one encoding, and then decode/encode it properly. So numpy should support that -- which would be handled by a 'bytes' type, just like in Python itself. But sometimes that isn't practical, and still doesn't work 100% -- in which case, we can go with latin-1, and there will be some weird, incorrect characters in there, and that is OK -- we fix them later when QA/QC or users notice it -- really just like a typo. But stripping the non-ascii characters out would be a worse solution. As would "replace", as sometimes it IS the correct symbol! (european encodings aren't totally incompatible...). And surrogateescape is worse, too -- any "weird" character is the same to my users, and at least sometimes it will be the right character -- however surrogateescape gets printed, it will never look right. (and can it even be handles by a non-python system?) - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this. Granted, I should probably simply use a proper unicode type for filenames anyway, but sometimes the data comes in already encoded as latin-something. In the end I still see no downside to latin-1 over ascii-only -- only an upside. I don't think that silently (mis)interpreting non-ASCII characters as > latin-1/9 is a good idea, which is why I think it would be a mistake to use > 'latin-1' for text data with unknown encoding. > if it's totally unknown, then yes -- but for totally uknown, bytes is the only reasonable option -- then run chardet or something over it. but "some latin encoding" -- latin-1 is a good choice. I could get behind a data type that compares equal to strings for ASCII > only and allows for *storing* other characters, but making blind > assumptions about characters 128-255 seems like a recipe for disaster. > Imagine text[unknown] as a one character string type, but it supports > .decode() like bytes and every character in the range 128-255 compares for > equality with other characters like NaN -- not even equal to itself. > would this be ascii with surrogateescape? -- almost, though I think the surrogateescapes would compare equal if they were equal -- which, now that I think about it would be what you want -- why preserve the bytes if they aren't an important part of the data? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 12:45:20 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 09:45:20 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern wrote: > > My question: What are those non-ASCII characters? How often are they > truly latin-1/9 vs. some other text encoding vs. non-string binary data? > > I don't know that we can reasonably make that accounting relevant. Number > of such characters per byte of text? Number of files with such characters > out of all existing files? > I have a lot of mostly english -- usually not latin-1, but usually mostly latin-1. -- the non-ascii characters are a handful of accented characters (usually from spanish, some french), then a few "scientific" characters: the degree symbol, the "micro" symbol. I suspect that this is not an unusual pattern for mostly-english scientific text. if it's non-string binary data, I know it -- and I'd use a bytes type. I have two options -- try to detect the encoding properly or use _something_ and fix it up later. latin-1 is a great choice for the later option -- most of the text displays fine, and the wrong stuff is untouched, so I can figure it out. What I can say with assurance is that every time I have decided, as a > developer, to write code that just hardcodes latin-1 for such cases, I have > regretted it. While it's just personal anecdote, I think it's at least > measuring the right thing. :-) > I've had the opposite experience -- so that's two anecdotes :-) If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but not really much worse then any other option other than properly decoding it. IN a way, using latin-1 is like the old py2 string -- it can be used as text, even if it has arbitrary non-text garbage in it... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 12:52:06 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 09:52:06 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: OK -- onto proposals: 1) The default behaviour for numpy arrays of strings is compatible with > Python3's string model: i.e. fully unicode supporting, and with a character > oriented interface. i.e. if you do:: > > arr = np.array(("this", "that",)) > > you get an array that can store ANY unicode string with 4 or less > characters. > > and arr[1] will return a native Python3 string object. > > This is the use-case for "casual" numpy users -- not the folks writing > H5py and the like, or the ones writing Cython bindings to C++ libs. > I see two options here: a) The current 'U' dtype -- fully meets the specs, and is already there. b) Having a pointer-to-a-python string dtype: -I take it that's what Pandas does and people seem happy. -That would get us variable length strings, and potentially other nifty string-processing. - It would lose the ability to interact at the binary level with other systems -- but do any other systems use UCS-4 anyway? - how would it work with pickle and numpy zip storage? Personally, I'm fine with (a), but (b) seems like it could be a nice addition. As the 'U' type already exists, the choice to add a python-string type is really orthogonal to the rest of this discussion. Note that I think using utf-8 internally to fit his need is a mistake -- it does not match well with the Python string model. That's it for use-case (1) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From ambrose.li at gmail.com Tue Apr 25 12:57:02 2017 From: ambrose.li at gmail.com (Ambrose LI) Date: Tue, 25 Apr 2017 12:57:02 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: 2017-04-25 12:34 GMT-04:00 Chris Barker : > I am totally euro-centric, but as I understand it, that is the whole point > of the desire for a compact one-byte-per character encoding. If there is a > strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we > should support that. But this all started with "mostly ascii". My take on > that is: But Shift-JIS is not one-byte; it's two-byte (unless you allow only half-width characters and nothing else). :-) In fact legacy CJK encodings are all nominally two-byte (so that the width of a character's internal representation matches that of its visual representation). > - filenames > > File names are one of the key reasons folks struggled with the python3 data > model (particularly on *nix) and why 'surrogateescape' was added. It's > pretty common to store filenames in with our data, and thus in numpy arrays > -- we need to preserve them exactly and display them mostly right. Again, > euro-centric, but if you are euro-centric, then latin-1 is a good choice for > this. This I don't understand. As far as I can tell non-Western-European filenames are not unusual. If filenames are a reason, even if you're euro-centric (think Eastern Europe, say) I don't see how latin1 is a good choice. Lurker here, and I haven't touched numpy in ages. So I might be blurting out nonsense. -- Ambrose Li // http://o.gniw.ca / http://gniw.ca If you saw this on CE-L: You do not need my permission to quote me, only proper attribution. Always cite your sources, even if you have to anonymize and/or cite it as "personal communication". From chris.barker at noaa.gov Tue Apr 25 13:04:53 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 10:04:53 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: > 2017-04-25 12:34 GMT-04:00 Chris Barker : > > I am totally euro-centric, > > But Shift-JIS is not one-byte; it's two-byte (unless you allow only > half-width characters and nothing else). :-) bad example then -- are their other non-euro-centric one byte per char encodings worth worrying about? I have no clue :-) > This I don't understand. As far as I can tell non-Western-European > filenames are not unusual. If filenames are a reason, even if you're > euro-centric (think Eastern Europe, say) I don't see how latin1 is a > good choice. > right -- this is the age of Unicode -- Unicode is the correct choice. But many of us have data in old files that are not proper Unicode -- and that includes filenames. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Apr 25 13:07:55 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 10:07:55 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. > > So I'll try again -- use-case only! we'll keep the possible solutions separate. > > Do we need to write up a NEP for this? it seems we are going a bit in circles, and we really do want to capture the final decision process. > > 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:: ... etc. These aren't use cases but rather requirements. I'm looking for something rather more concrete than that. * HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays. * FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small. * pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`. * There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform. * In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 13:02:17 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 25 Apr 2017 10:02:17 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: Now my proposal for the other use cases: 2) There be some way to store mostly ascii-compatible strings in a single > byte-per-character array -- so not to be wasting space for "typical > european-language-oriented data". Note: this should ALSO be compatible with > Python's character-oriented string model. i.e. a Python String with length > N will fit into a dtype of size N. > > arr = np.array(("this", "that",), dtype=np.single_byte_string) > > (name TBD) > > and arr[1] would return a python string. > > attempting to put in a not-compatible with the encoding String would > raise an EncodingError. > > This is also a use-case primarily for "casual" users -- but ones concerned > with the size of the data storage and know that are using european text. > more detail elsewhere -- but either ascii with surrageescape or latin-1 always are good options here. I prefer latin-1 (I really see no downside), but others disagree... But then we get to: > 3) dtypes that support storage in particular encodings: > We need utf-8. We may need others. We may need a 1-byte per char compact encoding that isn't close enough to ascii or latin-1 to be useful (say, shift-jis), And I don't think we are going to come to a consensus on what "single" encoding to use for 1-byte-per-char. So really -- going back to Julian's earlier proposal: dytpe with an encoding specified "size" in bytes once defined, numpy would encode/decode to/from python strings "correctly" we might need "null-terminated utf-8" as a special case. That would support all the other use cases. Even the one-byte per char encoding. I"d like to see a clean alias to a latin-1 encoding, but not a big deal. That leaves a couple decisions: - error out or truncate if the passed-in string is too long? - error out or suragateescape if there are invalid bytes in the data? - error out or something else if there are characters that can't be encoded in the specified encoding. And we still need a proper bytes type: 4) a fixed length bytes dtype -- pretty much what 'S' is now under python > three -- settable from a bytes or bytearray object (or other memoryview?), > and returns a bytes object. > > You could use astype() to convert between bytes and a specified encoding > with no change in binary representation. This could be used to store any > binary data, including encoded text or anything else. this should map > directly to the Python bytes model -- thus NOT null-terminted. > > This is a little different than 'S' behaviour on py3 -- it appears that > with 'S', a if ALL the trailing bytes are null, then it is truncated, but > if there is a null byte in the middle, then it is preserved. I suspect that > this is a legacy from Py2's use of "strings" as both text and binary data. > But in py3, a "bytes" type should be about bytes, and not text, and thus > null-values bytes are simply another value a byte can hold. > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From peridot.faceted at gmail.com Tue Apr 25 13:12:54 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Tue, 25 Apr 2017 17:12:54 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to nail down first -- as Robert has made > clear. > I would make my use-cases more user-specific: 1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation. -> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem. 2) User has to deal with fixed-width binary data from an external program/library and wants to see it as python strings. This may be systematically encoded in a known encoding (e.g. HDF5's fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS' zero-padded ASCII) or ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length may be signaled by null termination, null padding, or space padding. -> Solvable with a fixed-storage-size encoded-string dtype, as long as it has a parameter for how length is signaled. Python tricks for dealing with wrong or unknown encodings can make bogus data manageable. 3) User has to deal with fixed-width binary data from an external program/library that really is binary bytes. -> Solvable with a dtype that returns fixed-length byte strings. 4) User has a stupendous number (billions) of short strings which are mostly but not entirely ASCII and wants to manipulate them as strings. -> Not sure how to solve this. Maybe an object array with byte strings for storage and encoding information in the dtype, allowing transparent decoding? Or a fixed-storage-size array with a one-byte encoding that can cope with all the characters the user will ever want to use? 5) User has a bunch of mystery-encoding strings(?) and wants to store them in a numpy array. -> If they're python strings already, no further harm is done by treating this as case 1 when in python-land. If they need to be in fixed-width fields for communication with an external program or library, this puts us in case 2, unknown encoding variety; user will have to pick an encoding that the external program is likely to be able to cope with; this may be the one that originated the mystery strings in the first place. 6) User has python strings and wants to store them in non-object numpy arrays for some reason but doesn't care about the actual memory layout. -> Solvable with the current setup; fixed-width UCS-4 fields, padded with Unicode NULL. Happily, this comes for free from arbitrary-encoding fixed-storage-size dtypes, though a friendlier interface might be nice. Also allows people to use UCS-2 or ASCII if they know their strings fit. 7) User has data in one binary format and it needs to go into another, with perhaps casual inspection while in python-land. Such data is mostly ASCII but might contain mystery characters; presenting gobbledygook to the user is okay as long as the characters are output intact. -> Reading and writing as a fixed-width one-byte encoding, preferably one resembling the one the data is actually in, should work here. UTF-8 is likely to mangle the data; ASCII-with-surrogateescape might do okay. The key thing here is that both input and output files will have their own ways of specifying string length and their own storage specifiers; user must know these, and someone has to know and specify what to do with strings that don't fit. Simple truncation will mangle UTF-8 if it is not known to be UTF-8, but there's maybe not much that can be done about that. I guess my point is that a use case should specify: * Where does the data come from (i.e. in what format)? * Are there memory constraints in the storage format? * How should access look to the user? In particular, what should misencoded data look like? * Where does the data need to go? Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Apr 25 13:15:27 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 10:15:27 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker wrote: > > On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: >> >> 2017-04-25 12:34 GMT-04:00 Chris Barker : >> > I am totally euro-centric, > >> But Shift-JIS is not one-byte; it's two-byte (unless you allow only >> half-width characters and nothing else). :-) > > bad example then -- are their other non-euro-centric one byte per char encodings worth worrying about? I have no clue :-) I've run into Windows-1251 in files (seismic and well log data from Russian wells). Treating them as latin-1 does not make for a happy time. Both encodings also technically derive from ASCII in the lower half, but most of the actual language is written with the high-bit characters. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From peridot.faceted at gmail.com Tue Apr 25 13:34:37 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Tue, 25 Apr 2017 17:34:37 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > * HDF5 supports fixed-length and variable-length string arrays encoded in > ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite > the documentation claiming that there are more options). In practice, the > ASCII strings permit high-bit characters, but the encoding is unspecified. > Memory-mapping is rare (but apparently possible). The two major HDF5 > bindings are waiting for a fixed-length UTF-8 numpy dtype to support that > HDF5 option. Compression is supported for fixed-length string arrays but > not variable-length string arrays. > > * FITS supports fixed-length string arrays that are NULL-padded. The > strings do not have a formal encoding, but in practice, they are typically > mostly ASCII characters with the occasional high-bit character from an > unspecific encoding. Memory-mapping is a common practice. These arrays can > be quite large even if each scalar is reasonably small. > > * pandas uses object arrays for flexible in-memory handling of string > columns. Lengths are not fixed, and None is used as a marker for missing > data. String columns must be written to and read from a variety of formats, > including CSV, Excel, and HDF5, some of which are Unicode-aware and work > with `unicode/str` objects instead of `bytes`. > > * There are a number of sometimes-poorly-documented, > often-poorly-adhered-to, aging file format "standards" that include string > arrays but do not specify encodings, or such specification is ignored in > practice. This can make the usual "Unicode sandwich" at the I/O boundaries > difficult to perform. > > * In Python 3 environments, `unicode/str` objects are rather more common, > and simple operations like equality comparisons no longer work between > `bytes` and `unicode/str`, making it difficult to work with numpy string > arrays that yield `bytes` scalars. > It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats. I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway. Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From hodge at stsci.edu Tue Apr 25 13:51:19 2017 From: hodge at stsci.edu (Phil Hodge) Date: Tue, 25 Apr 2017 13:51:19 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On 04/25/2017 01:34 PM, Anne Archibald wrote: > I know they're not numpy-compatible, but FITS header values are > space-padded; does that occur elsewhere? Strings in FITS headers are delimited by single quotes. Some keywords (only a handful) are required to have values that are blank-padded (in the FITS file) if the value is less than eight characters. Whether you get trailing blanks when you read the header depends on the FITS reader. I use astropy.io.fits to read/write FITS files, and that interface strips trailing blanks from character strings: TARGPROP= 'UNKNOWN ' / Proposer's name for the target >>> fd = fits.open("test.fits") >>> s = fd[0].header['targprop'] >>> len(s) 7 Phil From peridot.faceted at gmail.com Tue Apr 25 13:54:39 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Tue, 25 Apr 2017 17:54:39 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 6:36 PM Chris Barker wrote: > > This is essentially my rant about use-case (2): > > A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be compatible with anything in particular? If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout? Presumably you're getting byte strings (with no NULLs) from somewhere and need to store them in this memory structure in a way that makes them as usable as possible in spite of their unknown encoding. Presumably the thing to do is just copy them in there as-is and then use .astype to arrange for python to decode them when accessed. So this is precisely the problem of "how should I decode random byte strings?" that python has been struggling with. My impression is that the community has established that there's no one solution that makes everyone happy, but that most people can cope with some combination of picking a one-byte encoding, ascii-with-surrogateescapes, zapping bogus characters, and giving wrong results. But I think that all the standard python alternatives are needed, in general, and in terms of interpreting numpy arrays full of bytes. Clearly your preferred solution is .astype("string[latin-9]"), but just as clearly that's not going to work for everyone. If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From peridot.faceted at gmail.com Tue Apr 25 14:00:20 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Tue, 25 Apr 2017 18:00:20 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge wrote: > On 04/25/2017 01:34 PM, Anne Archibald wrote: > > I know they're not numpy-compatible, but FITS header values are > > space-padded; does that occur elsewhere? > > Strings in FITS headers are delimited by single quotes. Some keywords > (only a handful) are required to have values that are blank-padded (in > the FITS file) if the value is less than eight characters. Whether you > get trailing blanks when you read the header depends on the FITS > reader. I use astropy.io.fits to read/write FITS files, and that > interface strips trailing blanks from character strings: > > TARGPROP= 'UNKNOWN ' / Proposer's name for the target > > >>> fd = fits.open("test.fits") > >>> s = fd[0].header['targprop'] > >>> len(s) > 7 > Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before they can go into a data structure. What I was wondering was whether people have data lying around with fixed-width fields where the strings are space-padded, so that numpy needs to support that. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Apr 25 14:18:57 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 25 Apr 2017 12:18:57 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald wrote: > > On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > >> * HDF5 supports fixed-length and variable-length string arrays encoded in >> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite >> the documentation claiming that there are more options). In practice, the >> ASCII strings permit high-bit characters, but the encoding is unspecified. >> Memory-mapping is rare (but apparently possible). The two major HDF5 >> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that >> HDF5 option. Compression is supported for fixed-length string arrays but >> not variable-length string arrays. >> >> * FITS supports fixed-length string arrays that are NULL-padded. The >> strings do not have a formal encoding, but in practice, they are typically >> mostly ASCII characters with the occasional high-bit character from an >> unspecific encoding. Memory-mapping is a common practice. These arrays can >> be quite large even if each scalar is reasonably small. >> >> * pandas uses object arrays for flexible in-memory handling of string >> columns. Lengths are not fixed, and None is used as a marker for missing >> data. String columns must be written to and read from a variety of formats, >> including CSV, Excel, and HDF5, some of which are Unicode-aware and work >> with `unicode/str` objects instead of `bytes`. >> >> * There are a number of sometimes-poorly-documented, >> often-poorly-adhered-to, aging file format "standards" that include string >> arrays but do not specify encodings, or such specification is ignored in >> practice. This can make the usual "Unicode sandwich" at the I/O boundaries >> difficult to perform. >> >> * In Python 3 environments, `unicode/str` objects are rather more common, >> and simple operations like equality comparisons no longer work between >> `bytes` and `unicode/str`, making it difficult to work with numpy string >> arrays that yield `bytes` scalars. >> > > It seems the greatest challenge is interacting with binary data from other > programs and libraries. If we were living entirely in our own data world, > Unicode strings in object arrays would generally be pretty satisfactory. So > let's try to get what is needed to read and write other people's formats. > > I'll note that this is numpy, so variable-width fields (e.g. CSV) don't > map directly to numpy arrays; we can store it however we want, as > conversion is necessary anyway. > > Clearly there is a need for fixed-storage-size zero-padded UTF-8; two > other packages are waiting specifically for it. But specifying this > requires two pieces of information: What is the encoding? and How is the > length specified? I know they're not numpy-compatible, but FITS header > values are space-padded; does that occur elsewhere? Are there other ways > existing data specifies string length within a fixed-size field? There are > some cryptographic length-specification tricks - ANSI X.293, ISO 10126, > PKCS7, etc. - but they are probably too specialized to need? We should make > sure we can support all the ways that actually occur. > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. For byte strings, it looks like we need a parameterized type. This is for two uses, display and conversion to (Python) unicode. One could handle the display and conversion using view and astype methods. For instance, we already have In [1]: a = array([1,2,3], uint8) + 0x30 In [2]: a.view('S1') Out[2]: array(['1', '2', '3'], dtype='|S1') In [3]: a.view('S1').astype('U') Out[3]: array([u'1', u'2', u'3'], dtype=' From wieser.eric+numpy at gmail.com Tue Apr 25 14:46:36 2017 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Tue, 25 Apr 2017 18:46:36 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing. Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as: In [1]: a = array([1,2,3], uint8) + 0x30 In [2]: a.view('S1') Out[2]: array(['1', '2', '3'], dtype='|S1') In [3]: a.view('U[ascii]') Out[3]: array([u'1', u'2', u'3'], dtype=' wrote: On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald > wrote: > >> >> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern >> wrote: >> >>> * HDF5 supports fixed-length and variable-length string arrays encoded >>> in ASCII and UTF-8. In all cases, these strings are NULL-terminated >>> (despite the documentation claiming that there are more options). In >>> practice, the ASCII strings permit high-bit characters, but the encoding is >>> unspecified. Memory-mapping is rare (but apparently possible). The two >>> major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to >>> support that HDF5 option. Compression is supported for fixed-length string >>> arrays but not variable-length string arrays. >>> >>> * FITS supports fixed-length string arrays that are NULL-padded. The >>> strings do not have a formal encoding, but in practice, they are typically >>> mostly ASCII characters with the occasional high-bit character from an >>> unspecific encoding. Memory-mapping is a common practice. These arrays can >>> be quite large even if each scalar is reasonably small. >>> >>> * pandas uses object arrays for flexible in-memory handling of string >>> columns. Lengths are not fixed, and None is used as a marker for missing >>> data. String columns must be written to and read from a variety of formats, >>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work >>> with `unicode/str` objects instead of `bytes`. >>> >>> * There are a number of sometimes-poorly-documented, >>> often-poorly-adhered-to, aging file format "standards" that include string >>> arrays but do not specify encodings, or such specification is ignored in >>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries >>> difficult to perform. >>> >>> * In Python 3 environments, `unicode/str` objects are rather more >>> common, and simple operations like equality comparisons no longer work >>> between `bytes` and `unicode/str`, making it difficult to work with numpy >>> string arrays that yield `bytes` scalars. >>> >> >> It seems the greatest challenge is interacting with binary data from >> other programs and libraries. If we were living entirely in our own data >> world, Unicode strings in object arrays would generally be pretty >> satisfactory. So let's try to get what is needed to read and write other >> people's formats. >> >> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't >> map directly to numpy arrays; we can store it however we want, as >> conversion is necessary anyway. >> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two >> other packages are waiting specifically for it. But specifying this >> requires two pieces of information: What is the encoding? and How is the >> length specified? I know they're not numpy-compatible, but FITS header >> values are space-padded; does that occur elsewhere? Are there other ways >> existing data specifies string length within a fixed-size field? There are >> some cryptographic length-specification tricks - ANSI X.293, ISO 10126, >> PKCS7, etc. - but they are probably too specialized to need? We should make >> sure we can support all the ways that actually occur. >> > > Agree with the UTF-8 fixed byte length strings, although I would tend > towards null terminated. > > For byte strings, it looks like we need a parameterized type. This is for > two uses, display and conversion to (Python) unicode. One could handle the > display and conversion using view and astype methods. For instance, we > already have > > In [1]: a = array([1,2,3], uint8) + 0x30 > > In [2]: a.view('S1') > Out[2]: > array(['1', '2', '3'], > dtype='|S1') > > In [3]: a.view('S1').astype('U') > Out[3]: > array([u'1', u'2', u'3'], > dtype=' > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Apr 25 14:52:08 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 11:52:08 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.faceted at gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. > > > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Apr 25 15:29:22 2017 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 25 Apr 2017 12:29:22 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Apr 25, 2017 11:53 AM, "Robert Kern" wrote: On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.faceted at gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. > > > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. And to save anyone else having to check, numpy's current NUL-padded dtypes only strip trailing NULs, so they can round-trip strings that contain NULs, just not strings where NUL is the last character. So the set of strings representable by str/bytes is a strict superset of the set of strings representable by numpy U/S dtypes, which in turn is a strict superset of the set of strings representable by a hypothetical NUL-terminated dtype. (Of course this doesn't matter for most practical purposes, because people rarely make strings with embedded NULs.) -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Apr 25 15:30:27 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 25 Apr 2017 13:30:27 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > > > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < > peridot.faceted at gmail.com> wrote: > > >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two > other packages are waiting specifically for it. But specifying this > requires two pieces of information: What is the encoding? and How is the > length specified? I know they're not numpy-compatible, but FITS header > values are space-padded; does that occur elsewhere? Are there other ways > existing data specifies string length within a fixed-size field? There are > some cryptographic length-specification tricks - ANSI X.293, ISO 10126, > PKCS7, etc. - but they are probably too specialized to need? We should make > sure we can support all the ways that actually occur. > > > > > > Agree with the UTF-8 fixed byte length strings, although I would tend > towards null terminated. > > Just to clarify some terminology (because it wasn't originally clear to me > until I looked it up in reference to HDF5): > > * "NULL-padded" implies that, for a fixed width of N, there can be up to N > non-NULL bytes. Any extra space left over is padded with NULLs, but no > space needs to be reserved for NULLs. > > * "NULL-terminated" implies that, for a fixed width of N, there can be up > to N-1 non-NULL bytes. There must always be space reserved for the > terminating NULL. > > I'm not really sure if "NULL-padded" also specifies the behavior for > embedded NULLs. It's certainly possible to deal with them: just strip > trailing NULLs and leave any embedded ones alone. But I'm also sure that > there are some implementations somewhere that interpret the requirement as > "stop at the first NULL or the end of the fixed width, whichever comes > first", effectively being NULL-terminated just not requiring the reserved > space. > Thanks for the clarification. NULL-padded is what I meant. I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Apr 25 15:36:27 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 12:36:27 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.harris at gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.faceted at gmail.com> wrote: >> >> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. >> > >> > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. >> >> Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): >> >> * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. >> >> * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. >> >> I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. > > Thanks for the clarification. NULL-padded is what I meant. Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is NULL-terminated. > I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed. I'm not sure. Some of these fixed-width string arrays are embedded inside structured arrays with other dtypes. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Apr 25 15:37:16 2017 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 25 Apr 2017 12:37:16 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Apr 25, 2017 9:35 AM, "Chris Barker" wrote: - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this. Eh... First, on Windows and MacOS, filenames are natively Unicode. So you don't care about preserving the bytes, only the characters. It's only Linux and the other traditional unixes where filenames are natively bytestrings. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters. I'm not seeing how latin1 really helps anything here -- best case you still have to do something like the wsgi "encoding dance" before you could use the filenames. IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-). -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Apr 25 15:38:19 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 25 Apr 2017 13:38:19 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern > wrote: > >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < >> peridot.faceted at gmail.com> wrote: >> >> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two >> other packages are waiting specifically for it. But specifying this >> requires two pieces of information: What is the encoding? and How is the >> length specified? I know they're not numpy-compatible, but FITS header >> values are space-padded; does that occur elsewhere? Are there other ways >> existing data specifies string length within a fixed-size field? There are >> some cryptographic length-specification tricks - ANSI X.293, ISO 10126, >> PKCS7, etc. - but they are probably too specialized to need? We should make >> sure we can support all the ways that actually occur. >> > >> > >> > Agree with the UTF-8 fixed byte length strings, although I would tend >> towards null terminated. >> >> Just to clarify some terminology (because it wasn't originally clear to >> me until I looked it up in reference to HDF5): >> >> * "NULL-padded" implies that, for a fixed width of N, there can be up to >> N non-NULL bytes. Any extra space left over is padded with NULLs, but no >> space needs to be reserved for NULLs. >> >> * "NULL-terminated" implies that, for a fixed width of N, there can be up >> to N-1 non-NULL bytes. There must always be space reserved for the >> terminating NULL. >> >> I'm not really sure if "NULL-padded" also specifies the behavior for >> embedded NULLs. It's certainly possible to deal with them: just strip >> trailing NULLs and leave any embedded ones alone. But I'm also sure that >> there are some implementations somewhere that interpret the requirement as >> "stop at the first NULL or the end of the fixed width, whichever comes >> first", effectively being NULL-terminated just not requiring the reserved >> space. >> > > Thanks for the clarification. NULL-padded is what I meant. > > I'm wondering how much of the desired functionality we could get by simply > subclassing ndarray in python. I think we mostly want to be able to view > byte strings and convert to unicode if needed. > > And I think the really tricky part is sorting and rich comparison. Unfortunately, the comparison function is currently located in the c structure. I suppose we could define a c wrapper function to go in the slot. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Apr 25 15:52:06 2017 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 25 Apr 2017 12:52:06 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: On Apr 25, 2017 10:13 AM, "Anne Archibald" wrote: On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to nail down first -- as Robert has made > clear. > I would make my use-cases more user-specific: 1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation. -> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem. It's possible to do much better than this when defining a specialized variable-width string dtype. E.g. make the itemsize 8 bytes (like an object array, assuming a 64 bit system), but then for strings that can be encoded in 7 bytes or less of utf8 store them directly in the array; else store a pointer to a raw utf8 string on the heap. (Possibly with a reference count - there are some interesting tradeoffs there. I suspect 1-byte reference counts might be the way to go; if a logical copy would make it overflow then make an actual copy instead.) Anything involving the heap is going to have some overhead, but we don't need full fledged Python objects and once we give up mmap compatibility then there's a lot of room to tune. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 18:47:46 2017 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 25 Apr 2017 15:47:46 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: <1229716955908306730@unknownmsgid> A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a character....but this whole conversation started with that request...and I have sympathy .. no one likes to waste memory. After all, numpy support small numeric dtypes, too. Do you need your in-memory format for this data to be compatible with anything in particular? Not for this requirement -- binary interchange is another requirement. If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout? That's the whole point, yes. Object arrays would be a good solution to the full Unicode problem, not the "why am I wasting so much space when all my data are ascii ? Presumably you're getting byte strings (with unknown encoding. No -- thus is for creating and using mostly ascii string data with python and numpy. Unknown encoding bytes belong in byte arrays -- they are not text. I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that. Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters. If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; Or UCS-4. I think object arrays would be more problematic for npz storage, and raw "tostring" dumping. (And pickle?) not sure how important that is. And it would be good to have something that plays well with recarrays anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type. Exactly. -CHB -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 18:50:05 2017 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 25 Apr 2017 15:50:05 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: <5825136600369965739@unknownmsgid> Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before they can go into a data structure. What I was wondering was whether people have data lying around with fixed-width fields where the strings are space-padded, so that numpy needs to support that. I would say whether to strip space-padded strings should be the reader's problem, not numpy's -CHB -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Apr 25 19:11:27 2017 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 25 Apr 2017 16:11:27 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> Message-ID: <-2179002348619298640@unknownmsgid> > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > Eh... First, on Windows and MacOS, filenames are natively Unicode. Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable. > s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters. > IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-). I thought the Python file (and Path) APIs all required (Unicode) strings? That was the whole complaint! And no, bytestrings are not perfectly friendly in py3. This got really complicated and sidetracked, but All I'm suggesting is that if we have a 1byte per char string type, with a fixed encoding, that that encoding be Latin-1, rather than ASCII. That's it, really. Having a settable encoding would work fine, too. -CHB From robert.kern at gmail.com Tue Apr 25 19:50:05 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 16:50:05 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <1229716955908306730@unknownmsgid> References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.barker at noaa.gov> wrote: >> Presumably you're getting byte strings (with unknown encoding. > > No -- thus is for creating and using mostly ascii string data with python and numpy. > > Unknown encoding bytes belong in byte arrays -- they are not text. You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with. > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that. That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-) > Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters. For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-) There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Apr 25 20:41:22 2017 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 25 Apr 2017 17:41:22 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <-2179002348619298640@unknownmsgid> References: <8741041756854148453@unknownmsgid> <-2179002348619298640@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal wrote: >> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > >> Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > knows? That may be simply unsolvable. >> s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters. > > >> IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-). > > I thought the Python file (and Path) APIs all required (Unicode) > strings? That was the whole complaint! No, the path APIs all accept bytestrings (and ones that return pathnames like listdir return bytestrings if given bytestrings). Or at least they're supposed to. The really urgent need for surrogateescape was things like sys.argv and os.environ where arbitrary bytes might come in (on some systems) but the API is restricted to strs. > And no, bytestrings are not perfectly friendly in py3. I'm not saying you should use them everywhere or that they remove the need for an ergonomic text dtype, but when you actually want to work with bytes they're pretty good (esp. in modern py3). -n -- Nathaniel J. Smith -- https://vorpus.org From charlesr.harris at gmail.com Tue Apr 25 21:27:57 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 25 Apr 2017 19:27:57 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < > chris.barker at noaa.gov> wrote: > > >> Presumably you're getting byte strings (with unknown encoding. > > > > No -- thus is for creating and using mostly ascii string data with > python and numpy. > > > > Unknown encoding bytes belong in byte arrays -- they are not text. > > You are welcome to try to convince Thomas of that. That is the status quo > for him, but he is finding that difficult to work with. > > > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, > with a few extra characters" data. With all the sloppiness over the years, > there are way to many files like that. > > That sloppiness that you mention is precisely the "unknown encoding" > problem. Your previous advocacy has also touched on using latin-1 to decode > existing files with unknown encodings as well. If you want to advocate for > using latin-1 only for the creation of new data, maybe stop talking about > existing files? :-) > > > Note: the primary use-case I have in mind is working with ascii text in > numpy arrays efficiently-- folks have called for that. All I'm saying is > use Latin-1 instead of ascii -- that buys you some useful extra characters. > > For that use case, the alternative in play isn't ASCII, it's UTF-8, which > buys you a whole bunch of useful extra characters. ;-) > > There are several use cases being brought forth here. Some involve file > reading, some involve file writing, and some involve in-memory > manipulation. Whatever change we make is going to impinge somehow on all of > the use cases. If all we do is add a latin-1 dtype for people to use to > create new in-memory data, then someone is going to use it to read existing > data in unknown or ambiguous encodings. > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories 1. Storage -- hdf5, .npy, fits, etc. 2. Display -- ? 3. Modification -- editing 4. Parsing -- fits, etc. There is probably no one solution that is optimal for all of those. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Tue Apr 25 21:55:52 2017 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Tue, 25 Apr 2017 21:55:52 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >> wrote: >> >> >> Presumably you're getting byte strings (with unknown encoding. >> > >> > No -- thus is for creating and using mostly ascii string data with >> > python and numpy. >> > >> > Unknown encoding bytes belong in byte arrays -- they are not text. >> >> You are welcome to try to convince Thomas of that. That is the status quo >> for him, but he is finding that difficult to work with. >> >> > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, >> > with a few extra characters" data. With all the sloppiness over the years, >> > there are way to many files like that. >> >> That sloppiness that you mention is precisely the "unknown encoding" >> problem. Your previous advocacy has also touched on using latin-1 to decode >> existing files with unknown encodings as well. If you want to advocate for >> using latin-1 only for the creation of new data, maybe stop talking about >> existing files? :-) >> >> > Note: the primary use-case I have in mind is working with ascii text in >> > numpy arrays efficiently-- folks have called for that. All I'm saying is use >> > Latin-1 instead of ascii -- that buys you some useful extra characters. >> >> For that use case, the alternative in play isn't ASCII, it's UTF-8, which >> buys you a whole bunch of useful extra characters. ;-) >> >> There are several use cases being brought forth here. Some involve file >> reading, some involve file writing, and some involve in-memory manipulation. >> Whatever change we make is going to impinge somehow on all of the use cases. >> If all we do is add a latin-1 dtype for people to use to create new >> in-memory data, then someone is going to use it to read existing data in >> unknown or ambiguous encodings. > > > > The maximum length of an UTF-8 character is 4 bytes, so we could use that to > size arrays by character length. The advantage over UTF-32 is that it is > easily compressible, probably by a factor of 4 in many cases. That doesn't > solve the in memory problem, but does have some advantages on disk as well > as making for easy display. We could compress it ourselves after encoding by > truncation. > > Note that for terminal display we will want something supported by the > system, which is another problem altogether. Let me break the problem down > into four categories > > Storage -- hdf5, .npy, fits, etc. > Display -- ? > Modification -- editing > Parsing -- fits, etc. > > There is probably no one solution that is optimal for all of those. > > Chuck > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > quoting Julian ''' I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate. My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. ''' I don't quite understand why this discussion goes in a direction of an either one XOR the other dtype. I thought the parameterized 1-byte encoding that Julian mentioned initially sounds useful to me. (I'm not sure I will use it much, but I also don't use float16 ) Josef From aldcroft at head.cfa.harvard.edu Tue Apr 25 22:02:38 2017 From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas) Date: Tue, 25 Apr 2017 22:02:38 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <-2179002348619298640@unknownmsgid> References: <8741041756854148453@unknownmsgid> <-2179002348619298640@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal < chris.barker at noaa.gov> wrote: > > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > > > Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > knows? That may be simply unsolvable. > > s. And then from in Python, if you want to actually work with those > filenames you need to either have a bytestring type or else a Unicode type > that uses surrogateescape to represent the non-ascii characters. > > > > IMO if you have filenames that are arbitrary bytestrings and you need to > represent this properly, you should just use bytestrings -- really, they're > perfectly friendly :-). > > I thought the Python file (and Path) APIs all required (Unicode) > strings? That was the whole complaint! > > And no, bytestrings are not perfectly friendly in py3. > > This got really complicated and sidetracked, but All I'm suggesting is > that if we have a 1byte per char string type, with a fixed encoding, > that that encoding be Latin-1, rather than ASCII. > > That's it, really. > Fully agreed. > > Having a settable encoding would work fine, too. > Yup. At a simple level, I just want the things that currently work just fine in Py2 to start working in Py3. That includes being able to read / manipulate / compute and write back to legacy binary FITS and HDF5 files that include ASCII-ish text data (not strictly ASCII). Memory mapping such files should be supportable. Swapping type from bytes to a 1-byte char str should be possible without altering data in memory. BTW, I am saying "I want", but this functionality would definitely be welcome in astropy. I wrote a unicode sandwich workaround for the astropy Table class (https://github.com/astropy/astropy/pull/5700) which should be in the next release. It would be way better to have this at a level lower in numpy. - Tom > > -CHB > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Apr 26 00:20:46 2017 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 25 Apr 2017 21:20:46 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris wrote: > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Apr 26 01:19:22 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 26 Apr 2017 05:19:22 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < > charlesr.harris at gmail.com> wrote: > > > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many cases. That > doesn't solve the in memory problem, but does have some advantages on disk > as well as making for easy display. We could compress it ourselves after > encoding by truncation. > > The major use case that we have for a UTF-8 array is HDF5, and it > specifies the width in bytes, not Unicode characters. > It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Apr 26 05:15:36 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 26 Apr 2017 11:15:36 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> On 26.04.2017 03:55, josef.pktd at gmail.com wrote: > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > wrote: >> >> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >>> wrote: >>> >>>>> Presumably you're getting byte strings (with unknown encoding. >>>> >>>> No -- thus is for creating and using mostly ascii string data with >>>> python and numpy. >>>> >>>> Unknown encoding bytes belong in byte arrays -- they are not text. >>> >>> You are welcome to try to convince Thomas of that. That is the status quo >>> for him, but he is finding that difficult to work with. >>> >>>> I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, >>>> with a few extra characters" data. With all the sloppiness over the years, >>>> there are way to many files like that. >>> >>> That sloppiness that you mention is precisely the "unknown encoding" >>> problem. Your previous advocacy has also touched on using latin-1 to decode >>> existing files with unknown encodings as well. If you want to advocate for >>> using latin-1 only for the creation of new data, maybe stop talking about >>> existing files? :-) >>> >>>> Note: the primary use-case I have in mind is working with ascii text in >>>> numpy arrays efficiently-- folks have called for that. All I'm saying is use >>>> Latin-1 instead of ascii -- that buys you some useful extra characters. >>> >>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which >>> buys you a whole bunch of useful extra characters. ;-) >>> >>> There are several use cases being brought forth here. Some involve file >>> reading, some involve file writing, and some involve in-memory manipulation. >>> Whatever change we make is going to impinge somehow on all of the use cases. >>> If all we do is add a latin-1 dtype for people to use to create new >>> in-memory data, then someone is going to use it to read existing data in >>> unknown or ambiguous encodings. >> >> >> >> The maximum length of an UTF-8 character is 4 bytes, so we could use that to >> size arrays by character length. The advantage over UTF-32 is that it is >> easily compressible, probably by a factor of 4 in many cases. That doesn't >> solve the in memory problem, but does have some advantages on disk as well >> as making for easy display. We could compress it ourselves after encoding by >> truncation. >> >> Note that for terminal display we will want something supported by the >> system, which is another problem altogether. Let me break the problem down >> into four categories >> >> Storage -- hdf5, .npy, fits, etc. >> Display -- ? >> Modification -- editing >> Parsing -- fits, etc. >> >> There is probably no one solution that is optimal for all of those. >> >> Chuck >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > > quoting Julian > > ''' > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. > In the end what will be done allows any encoding via a dtype with > metadata like datetime. > This allows any codec (including truncated utf8) to be added easily (if > python supports it) and allows sidestepping the debate. > > My main concern is whether it should be a new dtype or modifying the > unicode dtype. Though the backward compatibility argument is strongly in > favour of adding a new dtype that makes the np.unicode type redundant. > ''' > > I don't quite understand why this discussion goes in a direction of an > either one XOR the other dtype. > > I thought the parameterized 1-byte encoding that Julian mentioned > initially sounds useful to me. > > (I'm not sure I will use it much, but I also don't use float16 ) > > Josef Indeed, Most of this discussion is irrelevant to numpy. Numpy only really deals with the in memory storage of strings. And in that it is limited to fixed length strings (in bytes/codepoints). How you get your messy strings into numpy arrays is not very relevant to the discussion of a smaller representation of strings. You couldn't get messy strings into numpy without first sorting it out yourself before, you won't be able to afterwards. Numpy will offer a set of encodings, the user chooses which one is best for the use case and if the user screws it up, it is not numpy's problem. You currently only have a few ways to even construct string arrays: - array construction and loops - genfromtxt (which is again just a loop) - memory mapping which I seriously doubt anyone actually does for the S and U dtype Having a new dtype changes nothing here. You still need to create numpy arrays from python strings which are well defined and clean. If you put something in that doesn't encode you get an encoding error. No oddities like surrogate escapes are needed, numpy arrays are not interfaces to operating systems nor does numpy need to _add_ support for historical oddities beyond what it already has. If you want to represent bytes exactly as they came in don't use a text dtype (which includes the S dtype, use i1). Concerning variable sized strings, this is simply not going to happen. Nobody is going to rewrite numpy to support it, especially not just for something as unimportant as strings. Best you are going to get (or better already have) is object arrays. It makes no sense to discuss it unless someone comes up with an actual proposal and the willingness to code it. What is a relevant discussion is whether we really need a more compact but limited representation of text than 4-byte utf32 at all. Its usecase is for the most part just for python3 porting and saving some memory in some ascii heavy cases, e.g. astronomy. It is not that significant anymore as porting to python3 has mostly already happened via the ugly byte workaround and memory saving is probably not as significant in the context of numpy which is already heavy on memory usage. My initial approach was to not add a new dtype but to make unicode parametrizable which would have meant almost no cluttering of numpys internals and keeping the api more or less consistent which would make this a relatively simple addition of minor functionality for people that want it. But adding a completely new partially redundant dtype for this usecase may be a too large change to the api. Having two partially redundant string types may confuse users more than our current status quo of our single string type (U). Discussing whether we want to support truncated utf8 has some merit as it is a decision whether to give the users an even larger gun to shot themselves in the foot with. But I'd like to focus first on the 1 byte type to add a symmetric API for python2 and python3. utf8 can always be added latter should we deem it a good idea. cheers, Julian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 845 bytes Desc: OpenPGP digital signature URL: From peridot.faceted at gmail.com Wed Apr 26 06:27:00 2017 From: peridot.faceted at gmail.com (Anne Archibald) Date: Wed, 26 Apr 2017 10:27:00 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: > On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > >> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >> > The maximum length of an UTF-8 character is 4 bytes, so we could use >> that to size arrays by character length. The advantage over UTF-32 is that >> it is easily compressible, probably by a factor of 4 in many cases. That >> doesn't solve the in memory problem, but does have some advantages on disk >> as well as making for easy display. We could compress it ourselves after >> encoding by truncation. >> >> The major use case that we have for a UTF-8 array is HDF5, and it >> specifies the width in bytes, not Unicode characters. >> > > It's not just HDF5. Counting bytes is the Right Way to measure the size of > UTF-8 encoded text: > http://utf8everywhere.org/#myths > > I also firmly believe (though clearly this is not universally agreed upon) > that UTF-8 is the Right Way to encode strings for *non-legacy* > applications. So if we're adding any new string encodings, it needs to be > one of them. > It seems to me that most of the requirements people have expressed in this thread would be satisfied by: (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.) (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. There seems to be considerable debate about what the "default" string type should be, but since users must specify a length anyway, might as well force them to specify an encoding and thus dodge the debate about the right default. The other question - which I realize is how the thread started - is what to do about backward compatibility. I'm not writing the code, so my opinion doesn't matter much, but I think we're stuck maintaining what we have now - ASCII and UCS4 strings - for a while yet. But it can be deprecated, or they can be simply reimplemented as shorthand names for ASCII- or UCS4-encoded strings in the bytes-with-encoding dtype. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Apr 26 11:19:13 2017 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 26 Apr 2017 09:19:13 -0600 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > On 26.04.2017 03:55, josef.pktd at gmail.com wrote: > > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > > wrote: > >> > >> > >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern > wrote: > >>> > >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal > >>> wrote: > >>> > >>>>> Presumably you're getting byte strings (with unknown encoding. > >>>> > >>>> No -- thus is for creating and using mostly ascii string data with > >>>> python and numpy. > >>>> > >>>> Unknown encoding bytes belong in byte arrays -- they are not text. > >>> > >>> You are welcome to try to convince Thomas of that. That is the status > quo > >>> for him, but he is finding that difficult to work with. > >>> > >>>> I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, > >>>> with a few extra characters" data. With all the sloppiness over the > years, > >>>> there are way to many files like that. > >>> > >>> That sloppiness that you mention is precisely the "unknown encoding" > >>> problem. Your previous advocacy has also touched on using latin-1 to > decode > >>> existing files with unknown encodings as well. If you want to advocate > for > >>> using latin-1 only for the creation of new data, maybe stop talking > about > >>> existing files? :-) > >>> > >>>> Note: the primary use-case I have in mind is working with ascii text > in > >>>> numpy arrays efficiently-- folks have called for that. All I'm saying > is use > >>>> Latin-1 instead of ascii -- that buys you some useful extra > characters. > >>> > >>> For that use case, the alternative in play isn't ASCII, it's UTF-8, > which > >>> buys you a whole bunch of useful extra characters. ;-) > >>> > >>> There are several use cases being brought forth here. Some involve file > >>> reading, some involve file writing, and some involve in-memory > manipulation. > >>> Whatever change we make is going to impinge somehow on all of the use > cases. > >>> If all we do is add a latin-1 dtype for people to use to create new > >>> in-memory data, then someone is going to use it to read existing data > in > >>> unknown or ambiguous encodings. > >> > >> > >> > >> The maximum length of an UTF-8 character is 4 bytes, so we could use > that to > >> size arrays by character length. The advantage over UTF-32 is that it is > >> easily compressible, probably by a factor of 4 in many cases. That > doesn't > >> solve the in memory problem, but does have some advantages on disk as > well > >> as making for easy display. We could compress it ourselves after > encoding by > >> truncation. > >> > >> Note that for terminal display we will want something supported by the > >> system, which is another problem altogether. Let me break the problem > down > >> into four categories > >> > >> Storage -- hdf5, .npy, fits, etc. > >> Display -- ? > >> Modification -- editing > >> Parsing -- fits, etc. > >> > >> There is probably no one solution that is optimal for all of those. > >> > >> Chuck > >> > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion at python.org > >> https://mail.python.org/mailman/listinfo/numpy-discussion > >> > > > > > > quoting Julian > > > > ''' > > I probably have formulated my goal with the proposal a bit better, I am > > not very interested in a repetition of which encoding to use debate. > > In the end what will be done allows any encoding via a dtype with > > metadata like datetime. > > This allows any codec (including truncated utf8) to be added easily (if > > python supports it) and allows sidestepping the debate. > > > > My main concern is whether it should be a new dtype or modifying the > > unicode dtype. Though the backward compatibility argument is strongly in > > favour of adding a new dtype that makes the np.unicode type redundant. > > ''' > > > > I don't quite understand why this discussion goes in a direction of an > > either one XOR the other dtype. > > > > I thought the parameterized 1-byte encoding that Julian mentioned > > initially sounds useful to me. > > > > (I'm not sure I will use it much, but I also don't use float16 ) > > > > Josef > > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you get your messy strings into numpy arrays is not very relevant to > the discussion of a smaller representation of strings. > You couldn't get messy strings into numpy without first sorting it out > yourself before, you won't be able to afterwards. > Numpy will offer a set of encodings, the user chooses which one is best > for the use case and if the user screws it up, it is not numpy's problem. > > You currently only have a few ways to even construct string arrays: > - array construction and loops > - genfromtxt (which is again just a loop) > - memory mapping which I seriously doubt anyone actually does for the S > and U dtype > > Having a new dtype changes nothing here. You still need to create numpy > arrays from python strings which are well defined and clean. > If you put something in that doesn't encode you get an encoding error. > No oddities like surrogate escapes are needed, numpy arrays are not > interfaces to operating systems nor does numpy need to _add_ support for > historical oddities beyond what it already has. > If you want to represent bytes exactly as they came in don't use a text > dtype (which includes the S dtype, use i1). > > Concerning variable sized strings, this is simply not going to happen. > Nobody is going to rewrite numpy to support it, especially not just for > something as unimportant as strings. > Best you are going to get (or better already have) is object arrays. It > makes no sense to discuss it unless someone comes up with an actual > proposal and the willingness to code it. > > > What is a relevant discussion is whether we really need a more compact > but limited representation of text than 4-byte utf32 at all. > Its usecase is for the most part just for python3 porting and saving > some memory in some ascii heavy cases, e.g. astronomy. > It is not that significant anymore as porting to python3 has mostly > already happened via the ugly byte workaround and memory saving is > probably not as significant in the context of numpy which is already > heavy on memory usage. > > My initial approach was to not add a new dtype but to make unicode > parametrizable which would have meant almost no cluttering of numpys > internals and keeping the api more or less consistent which would make > this a relatively simple addition of minor functionality for people that > want it. > But adding a completely new partially redundant dtype for this usecase > may be a too large change to the api. Having two partially redundant > string types may confuse users more than our current status quo of our > single string type (U). > > Discussing whether we want to support truncated utf8 has some merit as > it is a decision whether to give the users an even larger gun to shot > themselves in the foot with. > But I'd like to focus first on the 1 byte type to add a symmetric API > for python2 and python3. > utf8 can always be added latter should we deem it a good idea. > I think we can implement viewers for strings as ndarray subclasses. Then one could do `my_string_array.view(latin_1)`, and so on. Essentially that just changes the default encoding of the 'S' array. That could also work for uint8 arrays if needed. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Wed Apr 26 11:39:46 2017 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Wed, 26 Apr 2017 16:39:46 +0100 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: > I think we can implement viewers for strings as ndarray subclasses. Then one > could > do `my_string_array.view(latin_1)`, and so on. Essentially that just > changes the default > encoding of the 'S' array. That could also work for uint8 arrays if needed. > > Chuck To handle structured data-types containing encoded strings, we'd also need to subclass `np.void`. Things would get messy when a structured dtype contains two strings in different encodings (or more likely, one bytestring and one textstring) - we'd need some way to specify which fields are in which encoding, and using subclasses means that this can't be contained within the dtype information. So I think there's a strong argument for solving this with`dtype`s rather than subclasses. This really doesn't seem hard though. Something like (C-but-as-python): def ENCSTRING_getitem(ptr, arr): # The PyArrFuncs slot encoded = STRING_getitem(ptr, arr) return encoded.decode(arr.dtype.encoding) def ENCSTRING_setitem(val, ptr, arr): # The PyArrFuncs slot val = val.encode(arr.dtype.encoding) # todo: handle "safe" truncation, where safe might mean keep codepoints, keep graphemes, or never allow STRING_setitem(val, ptr, arr)) We'd probably need to be careful to do a decode/encode dance when copying from one encoding to another, but we [already have bugs](https://github.com/numpy/numpy/issues/3258) in those cases anyway. Is it reasonable that the user of such an array would want to work with plain `builtin.unicode` objects, rather than some special numpy scalar type? Eric From chris.barker at noaa.gov Wed Apr 26 12:28:48 2017 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Wed, 26 Apr 2017 09:28:48 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: <-5378706506035339722@unknownmsgid> > > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that. > > That sloppiness that you mention is precisely the "unknown encoding" problem. Exactly -- but from a practicality beats purity perspective, there is a difference between "I have no idea whatsoever" and "I know it is mostly ascii, and European, but there are some extra characters in there" Latin-1 had proven very useful for that case. I suppose in most cases ascii with errors='replace' would be a good choice, but I'd still rather not throw out potentially useful information. > Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-) Yeah, I've been very unfocused in this discussion -- sorry about that. > > Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters. > > For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-) UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier. So I think the easy to access, and particularly defaults, numpy string dtypes should match it. It's become clear in this discussion that there is s strong desire to support a numpy dtype that stores text in particular binary formats (I.e. Encodings). Rather than choose one or two, we might as well support all encodings supported by python. In that case, we'll have utf-8 for those that know they want that, and we'll have latin-1 for those that incorrectly think they want that :-) So what remains is to decide is implementation, syntax, and defaults. Let's keep in mind that most of us on this list, and in this discussion, are the folks that write interface code and the like. But most numpy users are not as tuned in to the internals. So defaults should be set to best support the more "naive" user. > . If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings. If we add every encoding known to man someone is going to use Latin-1 to read unknown encodings. Indeed, as we've all pointed out, there is no correct encoding with which to read unknown encodings. Frankly, if we have UTF-8 under the hood, I think people are even MORE likely to use it inappropriately-- it's quite scary how many people think UTF-8 == Unicode, and think all you need to do is "use utf-8", and you don't need to change any of the rest of your code. Oh, and once you've done that, you can use your existing ASCII-only tests and think you have a working application :-) -CHB From robert.kern at gmail.com Wed Apr 26 13:08:01 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 10:08:01 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you get your messy strings into numpy arrays is not very relevant to > the discussion of a smaller representation of strings. > You couldn't get messy strings into numpy without first sorting it out > yourself before, you won't be able to afterwards. > Numpy will offer a set of encodings, the user chooses which one is best > for the use case and if the user screws it up, it is not numpy's problem. > > You currently only have a few ways to even construct string arrays: > - array construction and loops > - genfromtxt (which is again just a loop) > - memory mapping which I seriously doubt anyone actually does for the S > and U dtype I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays. You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays. Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am. > Having a new dtype changes nothing here. You still need to create numpy > arrays from python strings which are well defined and clean. > If you put something in that doesn't encode you get an encoding error. > No oddities like surrogate escapes are needed, numpy arrays are not > interfaces to operating systems nor does numpy need to _add_ support for > historical oddities beyond what it already has. > If you want to represent bytes exactly as they came in don't use a text > dtype (which includes the S dtype, use i1). Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities. > Concerning variable sized strings, this is simply not going to happen. > Nobody is going to rewrite numpy to support it, especially not just for > something as unimportant as strings. > Best you are going to get (or better already have) is object arrays. It > makes no sense to discuss it unless someone comes up with an actual > proposal and the willingness to code it. No one has suggested such a thing. At most, we've talked about specializing object arrays. > What is a relevant discussion is whether we really need a more compact > but limited representation of text than 4-byte utf32 at all. > Its usecase is for the most part just for python3 porting and saving > some memory in some ascii heavy cases, e.g. astronomy. > It is not that significant anymore as porting to python3 has mostly > already happened via the ugly byte workaround and memory saving is > probably not as significant in the context of numpy which is already > heavy on memory usage. > > My initial approach was to not add a new dtype but to make unicode > parametrizable which would have meant almost no cluttering of numpys > internals and keeping the api more or less consistent which would make > this a relatively simple addition of minor functionality for people that > want it. > But adding a completely new partially redundant dtype for this usecase > may be a too large change to the api. Having two partially redundant > string types may confuse users more than our current status quo of our > single string type (U). > > Discussing whether we want to support truncated utf8 has some merit as > it is a decision whether to give the users an even larger gun to shot > themselves in the foot with. > But I'd like to focus first on the 1 byte type to add a symmetric API > for python2 and python3. > utf8 can always be added latter should we deem it a good idea. What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date? If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtaylor.debian at googlemail.com Wed Apr 26 13:43:47 2017 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Wed, 26 Apr 2017 19:43:47 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: On 26.04.2017 19:08, Robert Kern wrote: > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > > wrote: > >> Indeed, >> Most of this discussion is irrelevant to numpy. >> Numpy only really deals with the in memory storage of strings. And in >> that it is limited to fixed length strings (in bytes/codepoints). >> How you get your messy strings into numpy arrays is not very relevant to >> the discussion of a smaller representation of strings. >> You couldn't get messy strings into numpy without first sorting it out >> yourself before, you won't be able to afterwards. >> Numpy will offer a set of encodings, the user chooses which one is best >> for the use case and if the user screws it up, it is not numpy's problem. >> >> You currently only have a few ways to even construct string arrays: >> - array construction and loops >> - genfromtxt (which is again just a loop) >> - memory mapping which I seriously doubt anyone actually does for the S >> and U dtype > > I fear that you decided that the discussion was irrelevant and thus did > not read it rather than reading it to decide that it was not relevant. > Because several of us have showed that, yes indeed, we do memory-map > string arrays. > > You can add to this list C APIs, like that of libhdf5, that need to > communicate (Unicode) string arrays. > > Look, I know I can be tedious, but *please* go back and read this > discussion. We have concrete use cases outlined. We can give you more > details if you need them. We all feel the pain of the rushed, inadequate > implementation of the U dtype. But each of our pains is a little bit > different; you obviously aren't experiencing the same pains that I am. I have read every mail and it has been a large waste of time, Everything has been said already many times in the last few years. Even if you memory map string arrays, of which I have not seen a concrete use case in the mails beyond "would be nice to have" without any backing in actual code, but I may have missed it. In any case it is still irrelevant. My proposal only _adds_ additional cases that can be mmapped. It does not prevent you from doing what you have been doing before. > >> Having a new dtype changes nothing here. You still need to create numpy >> arrays from python strings which are well defined and clean. >> If you put something in that doesn't encode you get an encoding error. >> No oddities like surrogate escapes are needed, numpy arrays are not >> interfaces to operating systems nor does numpy need to _add_ support for >> historical oddities beyond what it already has. >> If you want to represent bytes exactly as they came in don't use a text >> dtype (which includes the S dtype, use i1). > > Thomas Aldcroft has demonstrated the problem with this approach. numpy > arrays are often interfaces to files that have tons of historical oddities. This does not matter for numpy, the text dtype is well defined as bytes with a specific encoding and null padding. If you have an historical oddity that does not fit, do not use the text dtype but use a pure byte array instead. > >> Concerning variable sized strings, this is simply not going to happen. >> Nobody is going to rewrite numpy to support it, especially not just for >> something as unimportant as strings. >> Best you are going to get (or better already have) is object arrays. It >> makes no sense to discuss it unless someone comes up with an actual >> proposal and the willingness to code it. > > No one has suggested such a thing. At most, we've talked about > specializing object arrays. > >> What is a relevant discussion is whether we really need a more compact >> but limited representation of text than 4-byte utf32 at all. >> Its usecase is for the most part just for python3 porting and saving >> some memory in some ascii heavy cases, e.g. astronomy. >> It is not that significant anymore as porting to python3 has mostly >> already happened via the ugly byte workaround and memory saving is >> probably not as significant in the context of numpy which is already >> heavy on memory usage. >> >> My initial approach was to not add a new dtype but to make unicode >> parametrizable which would have meant almost no cluttering of numpys >> internals and keeping the api more or less consistent which would make >> this a relatively simple addition of minor functionality for people that >> want it. >> But adding a completely new partially redundant dtype for this usecase >> may be a too large change to the api. Having two partially redundant >> string types may confuse users more than our current status quo of our >> single string type (U). >> >> Discussing whether we want to support truncated utf8 has some merit as >> it is a decision whether to give the users an even larger gun to shot >> themselves in the foot with. >> But I'd like to focus first on the 1 byte type to add a symmetric API >> for python2 and python3. >> utf8 can always be added latter should we deem it a good idea. > > What is your current proposal? A string dtype parameterized with the > encoding (initially supporting the latin-1 that you desire and maybe > adding utf-8 later)? Or a latin-1-specific dtype such that we will have > to add a second utf-8 dtype at a later date? My proposal is a single new parameterizable dtype. Adding multiple dtypes for each encoding seems unnecessary to me given that numpy already supports parameterizable types. For example datetime is very similar, it is basically encoded integers. There are multiple encodings = units supported. > > If you're not going to support arbitrary encodings right off the bat, > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first > as they seem to knock off more use cases straight away. > Please list the use cases in the context of numpy usage. hdf5 is the most obvious, but how exactly would hdf5 use an utf8 array in the actual implementation? What you save by having utf8 in the numpy array is replacing a decoding ane encoding step with a stripping null padding step. That doesn't seem very worthwhile compared to all their other overheads involved. From robert.kern at gmail.com Wed Apr 26 13:45:20 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 10:45:20 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald wrote: > > On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: >> >> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.harris at gmail.com> wrote: >>> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. >>> >>> The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters. >> >> It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: >> http://utf8everywhere.org/#myths >> >> I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them. > > It seems to me that most of the requirements people have expressed in this thread would be satisfied by: > > (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.) > > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. > > (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. The void dtype is already there for this general purpose and mostly works, with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the scalars (fortunately, they do not appear to be mutable views). It also accepts `bytes` strings that are too short (pads with NULs) and too long (truncates). If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Apr 26 14:31:00 2017 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 26 Apr 2017 11:31:00 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <-5378706506035339722@unknownmsgid> References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" < chris.barker at noaa.gov> wrote: UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much longer rant about that earlier. So I think the easy to access, and particularly defaults, numpy string dtypes should match it. This seems a little vague? The "character-oriented Python text model" is just that str supports O(1) indexing of characters. But... Numpy doesn't. If you want to access individual characters inside a string inside an array, you have to pull out the scalar first, at which point the data is copied and boxed into a Python object anyway, using whatever representation the interpreter prefers. So AFAICT? it makes literally no difference to the user whether numpy's internal representation allows for fast character access. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Wed Apr 26 15:03:33 2017 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Wed, 26 Apr 2017 15:03:33 -0400 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith wrote: > On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" > wrote: > > > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the model python 3 has chosen. I wrote a much > longer rant about that earlier. > > So I think the easy to access, and particularly defaults, numpy string > dtypes should match it. > > > This seems a little vague? The "character-oriented Python text model" is > just that str supports O(1) indexing of characters. But... Numpy doesn't. If > you want to access individual characters inside a string inside an array, > you have to pull out the scalar first, at which point the data is copied and > boxed into a Python object anyway, using whatever representation the > interpreter prefers. So AFAICT it makes literally no difference to the user > whether numpy's internal representation allows for fast character access. you can create a view on individual characters or bytes, AFAICS >>> t = np.array(['abcdefg']*10) >>> t2 = t.view([('s%d' % i, '>> t2['s5'] array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'], dtype='>> t.view(' > -n > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From sebastian at sipsolutions.net Wed Apr 26 14:38:09 2017 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 26 Apr 2017 20:38:09 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: <1493231889.17161.3.camel@sipsolutions.net> On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote: > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > > m>> > > wrote: > > > > > Indeed, > > > Most of this discussion is irrelevant to numpy. > > > Numpy only really deals with the in memory storage of strings. > > > And in > > > that it is limited to fixed length strings (in bytes/codepoints). > > > How you get your messy strings into numpy arrays is not very > > > relevant to > > > the discussion of a smaller representation of strings. > > > You couldn't get messy strings into numpy without first sorting > > > it out > > > yourself before, you won't be able to afterwards. > > > Numpy will offer a set of encodings, the user chooses which one > > > is best > > > for the use case and if the user screws it up, it is not numpy's > > > problem. > > > > > > You currently only have a few ways to even construct string > > > arrays: > > > - array construction and loops > > > - genfromtxt (which is again just a loop) > > > - memory mapping which I seriously doubt anyone actually does for > > > the S > > > and U dtype > > > > I fear that you decided that the discussion was irrelevant and thus > > did > > not read it rather than reading it to decide that it was not > > relevant. > > Because several of us have showed that, yes indeed, we do memory- > > map > > string arrays. > > > > You can add to this list C APIs, like that of libhdf5, that need to > > communicate (Unicode) string arrays. > > > > Look, I know I can be tedious, but *please* go back and read this > > discussion. We have concrete use cases outlined. We can give you > > more > > details if you need them. We all feel the pain of the rushed, > > inadequate > > implementation of the U dtype. But each of our pains is a little > > bit > > different; you obviously aren't experiencing the same pains that I > > am. > > I have read every mail and it has been a large waste of time, > Everything > has been said already many times in the last few years. > Even if you memory map string arrays, of which I have not seen a > concrete use case in the mails beyond "would be nice to have" without > any backing in actual code, but I may have missed it. > In any case it is still irrelevant. My proposal only _adds_ > additional > cases that can be mmapped. It does not prevent you from doing what > you > have been doing before. > > > > > > Having a new dtype changes nothing here. You still need to create > > > numpy > > > arrays from python strings which are well defined and clean. > > > If you put something in that doesn't encode you get an encoding > > > error. > > > No oddities like surrogate escapes are needed, numpy arrays are > > > not > > > interfaces to operating systems nor does numpy need to _add_ > > > support for > > > historical oddities beyond what it already has. > > > If you want to represent bytes exactly as they came in don't use > > > a text > > > dtype (which includes the S dtype, use i1). > > > > Thomas Aldcroft has demonstrated the problem with this approach. > > numpy > > arrays are often interfaces to files that have tons of historical > > oddities. > > This does not matter for numpy, the text dtype is well defined as > bytes > with a specific encoding and null padding. If you have an historical > oddity that does not fit, do not use the text dtype but use a pure > byte > array instead. > > > > > > Concerning variable sized strings, this is simply not going to > > > happen. > > > Nobody is going to rewrite numpy to support it, especially not > > > just for > > > something as unimportant as strings. > > > Best you are going to get (or better already have) is object > > > arrays. It > > > makes no sense to discuss it unless someone comes up with an > > > actual > > > proposal and the willingness to code it. > > > > No one has suggested such a thing. At most, we've talked about > > specializing object arrays. > > > > > What is a relevant discussion is whether we really need a more > > > compact > > > but limited representation of text than 4-byte utf32 at all. > > > Its usecase is for the most part just for python3 porting and > > > saving > > > some memory in some ascii heavy cases, e.g. astronomy. > > > It is not that significant anymore as porting to python3 has > > > mostly > > > already happened via the ugly byte workaround and memory saving > > > is > > > probably not as significant in the context of numpy which is > > > already > > > heavy on memory usage. > > > > > > My initial approach was to not add a new dtype but to make > > > unicode > > > parametrizable which would have meant almost no cluttering of > > > numpys > > > internals and keeping the api more or less consistent which would > > > make > > > this a relatively simple addition of minor functionality for > > > people that > > > want it. > > > But adding a completely new partially redundant dtype for this > > > usecase > > > may be a too large change to the api. Having two partially > > > redundant > > > string types may confuse users more than our current status quo > > > of our > > > single string type (U). > > > > > > Discussing whether we want to support truncated utf8 has some > > > merit as > > > it is a decision whether to give the users an even larger gun to > > > shot > > > themselves in the foot with. > > > But I'd like to focus first on the 1 byte type to add a symmetric > > > API > > > for python2 and python3. > > > utf8 can always be added latter should we deem it a good idea. > > > > What is your current proposal? A string dtype parameterized with > > the > > encoding (initially supporting the latin-1 that you desire and > > maybe > > adding utf-8 later)? Or a latin-1-specific dtype such that we will > > have > > to add a second utf-8 dtype at a later date? > > My proposal is a single new parameterizable dtype. Adding multiple > dtypes for each encoding seems unnecessary to me given that numpy > already supports parameterizable types. > For example datetime is very similar, it is basically encoded > integers. > There are multiple encodings = units supported. > > > > > If you're not going to support arbitrary encodings right off the > > bat, > > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape > > first > > as they seem to knock off more use cases straight away. > > > > > Please list the use cases in the context of numpy usage. hdf5 is the > most obvious, but how exactly would hdf5 use an utf8 array in the > actual > implementation? > > What you save by having utf8 in the numpy array is replacing a > decoding > ane encoding step with a stripping null padding step. > That doesn't seem very worthwhile compared to all their other > overheads > involved. I remember talking with a colleague about something like that. And basically an annoying thing there was that if you strip the zero bytes in a zero padded string, some encodings (UTF16) may need one of the zero bytes to work right. (I think she got around it, by weird trickery, inverting the endianess or so and thus putting the zero bytes first). Maybe will ask her if this discussion is interesting to her. Though I think it might have been something like "make everything in hdf5/something similar work" without any actual use case, I don't know. Have not read the thread, I think a fixed byte but settable encoding type would make sense. I personally wonder whether storing the length might make sense, even if that removes direct memory mapping, but as you said, you can still memmap the bytes, and then probably just cast back and forth. Sorry if there is zero actual input here :) - Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: This is a digitally signed message part URL: From robert.kern at gmail.com Wed Apr 26 15:07:47 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 12:07:47 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> Message-ID: On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: > > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > > > > wrote: > > > >> Indeed, > >> Most of this discussion is irrelevant to numpy. > >> Numpy only really deals with the in memory storage of strings. And in > >> that it is limited to fixed length strings (in bytes/codepoints). > >> How you get your messy strings into numpy arrays is not very relevant to > >> the discussion of a smaller representation of strings. > >> You couldn't get messy strings into numpy without first sorting it out > >> yourself before, you won't be able to afterwards. > >> Numpy will offer a set of encodings, the user chooses which one is best > >> for the use case and if the user screws it up, it is not numpy's problem. > >> > >> You currently only have a few ways to even construct string arrays: > >> - array construction and loops > >> - genfromtxt (which is again just a loop) > >> - memory mapping which I seriously doubt anyone actually does for the S > >> and U dtype > > > > I fear that you decided that the discussion was irrelevant and thus did > > not read it rather than reading it to decide that it was not relevant. > > Because several of us have showed that, yes indeed, we do memory-map > > string arrays. > > > > You can add to this list C APIs, like that of libhdf5, that need to > > communicate (Unicode) string arrays. > > > > Look, I know I can be tedious, but *please* go back and read this > > discussion. We have concrete use cases outlined. We can give you more > > details if you need them. We all feel the pain of the rushed, inadequate > > implementation of the U dtype. But each of our pains is a little bit > > different; you obviously aren't experiencing the same pains that I am. > > I have read every mail and it has been a large waste of time, Everything > has been said already many times in the last few years. > Even if you memory map string arrays, of which I have not seen a > concrete use case in the mails beyond "would be nice to have" without > any backing in actual code, but I may have missed it. Yes, we have stated that FITS files with string arrays are currently being read via memory mapping. http://docs.astropy.org/en/stable/io/fits/index.html You were even pointed to a minor HDF5 implementation that memory maps: https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683 I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support". > In any case it is still irrelevant. My proposal only _adds_ additional > cases that can be mmapped. It does not prevent you from doing what you > have been doing before. You are the one who keeps worrying about the additional complexity, both in code and mental capacity of our users, of adding new overlapping dtypes and solutions, and you're not wrong about that. I think it behooves us to consider if there are solutions that solve multiple related problems at once instead of adding new dtypes piecemeal to solve individual problems. > >> Having a new dtype changes nothing here. You still need to create numpy > >> arrays from python strings which are well defined and clean. > >> If you put something in that doesn't encode you get an encoding error. > >> No oddities like surrogate escapes are needed, numpy arrays are not > >> interfaces to operating systems nor does numpy need to _add_ support for > >> historical oddities beyond what it already has. > >> If you want to represent bytes exactly as they came in don't use a text > >> dtype (which includes the S dtype, use i1). > > > > Thomas Aldcroft has demonstrated the problem with this approach. numpy > > arrays are often interfaces to files that have tons of historical oddities. > > This does not matter for numpy, the text dtype is well defined as bytes > with a specific encoding and null padding. You cannot dismiss something as "not mattering for *numpy*" just because your new, *proposed* text dtype doesn't support it. You seem to have fixed on a course of action and are defining everyone else's use cases as out-of-scope because your course of action doesn't support them. That's backwards. Define the use cases first, determine the requirements, then build a solution that meets those requirements. We skipped those steps before, and that's why we're all feeling the pain. > If you have an historical > oddity that does not fit, do not use the text dtype but use a pure byte > array instead. That's his status quo, and he finds it unworkable. Now, I have proposed a way out of that by supporting ASCII-surrogateescape as a specific encoding. It's not an ISO standard encoding, but the surrogeescape mechanism seems to be what the Python world has settled on for such situations. Would you support that with your parameterized-encoding text dtype? > >> Concerning variable sized strings, this is simply not going to happen. > >> Nobody is going to rewrite numpy to support it, especially not just for > >> something as unimportant as strings. > >> Best you are going to get (or better already have) is object arrays. It > >> makes no sense to discuss it unless someone comes up with an actual > >> proposal and the willingness to code it. > > > > No one has suggested such a thing. At most, we've talked about > > specializing object arrays. > > > >> What is a relevant discussion is whether we really need a more compact > >> but limited representation of text than 4-byte utf32 at all. > >> Its usecase is for the most part just for python3 porting and saving > >> some memory in some ascii heavy cases, e.g. astronomy. > >> It is not that significant anymore as porting to python3 has mostly > >> already happened via the ugly byte workaround and memory saving is > >> probably not as significant in the context of numpy which is already > >> heavy on memory usage. > >> > >> My initial approach was to not add a new dtype but to make unicode > >> parametrizable which would have meant almost no cluttering of numpys > >> internals and keeping the api more or less consistent which would make > >> this a relatively simple addition of minor functionality for people that > >> want it. > >> But adding a completely new partially redundant dtype for this usecase > >> may be a too large change to the api. Having two partially redundant > >> string types may confuse users more than our current status quo of our > >> single string type (U). > >> > >> Discussing whether we want to support truncated utf8 has some merit as > >> it is a decision whether to give the users an even larger gun to shot > >> themselves in the foot with. > >> But I'd like to focus first on the 1 byte type to add a symmetric API > >> for python2 and python3. > >> utf8 can always be added latter should we deem it a good idea. > > > > What is your current proposal? A string dtype parameterized with the > > encoding (initially supporting the latin-1 that you desire and maybe > > adding utf-8 later)? Or a latin-1-specific dtype such that we will have > > to add a second utf-8 dtype at a later date? > > My proposal is a single new parameterizable dtype. Adding multiple > dtypes for each encoding seems unnecessary to me given that numpy > already supports parameterizable types. > For example datetime is very similar, it is basically encoded integers. > There are multiple encodings = units supported. Okay great. What encodings are you intending to support? You seem to be pushing against supporting UTF-8. > > If you're not going to support arbitrary encodings right off the bat, > > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first > > as they seem to knock off more use cases straight away. > > Please list the use cases in the context of numpy usage. hdf5 is the > most obvious, but how exactly would hdf5 use an utf8 array in the actual > implementation? File reading: The user requests data from a fixed-width UTF-8 Dataset. E.g. h5py: >>> a = h5['/some_utf8_array'][:] h5py looks at the Dataset's shape (with the fixed width defined in bytes) and allocates a numpy UTF-8 array with the dtype being given the same bytewidth as specified by the Dataset. h5py fills in the data quickly in bulk using libhdf5's efficient APIs for such data movement. The user now has a numpy array whose scalars come out/go in as `unicode/str` objects. File writing: The user needs to create a string Dataset with Unicode characters. A fixed-width UTF-8 Dataset is preferred (in this case) over HDF5 variable-width Datasets because the latter is not compressible, and the strings are all reasonably close in size. The user's in-memory data may or may not be in a UTF-8 array (it might be in an object array of `unicode/str` string objects or a U-dtype array), but h5py can use numpy's conversion machinery to turn it into a numpy UTF-8 array (much like it can accept lists of floats and cast it to a float64 array). It can look at the UTF-8 array's shape and itemsize to create the corresponding Dataset, and then pass the array to libhdf5's efficient APIs for copying arrays of data into a Dataset. > What you save by having utf8 in the numpy array is replacing a decoding > ane encoding step with a stripping null padding step. > That doesn't seem very worthwhile compared to all their other overheads > involved. It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Apr 26 15:17:15 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 12:17:15 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <1493231889.17161.3.camel@sipsolutions.net> References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right. (I think she got around it, by weird > trickery, inverting the endianess or so and thus putting the zero bytes > first). > Maybe will ask her if this discussion is interesting to her. Though I > think it might have been something like "make everything in > hdf5/something similar work" without any actual use case, I don't know. I don't think that will be an issue for an encoding-parameterized dtype. The decoding machinery of that would have access to the full-width buffer for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes for UTF-16). It's only if you have to hack around at a higher level with numpy's S arrays, which return Python byte strings that strip off the trailing NULL bytes, that you have to worry about such things. Getting a Python scalar from the numpy S array loses information in such cases. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Apr 26 18:27:10 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 26 Apr 2017 15:27:10 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith wrote: > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the model python 3 has chosen. I wrote a much > longer rant about that earlier. > > So I think the easy to access, and particularly defaults, numpy string > dtypes should match it. > > > This seems a little vague? > sorry -- that's what I get for trying to be concise... > The "character-oriented Python text model" is just that str supports O(1) > indexing of characters. > not really -- I think the performance characteristics are an implementation detail (though it did influence the design, I'm sure) I'm referring to the fact that a python string appears (to the user -- also under the hood, but again, implementation detail) to be a sequence of characters, not a sequence of bytes, not a sequence of glyphs, or graphemes, or anything else. Every Python string has a length, and that length is the number of characters, and if you index you get a string of length-1, and it has one character it it, and that character matches to a code point of a single value. Someone could implement a python string using utf-8 under the hood, and none of that would change (and I think micropython may have done that...) Sure, you might get two characters when you really expect a single grapheme, but it's at least a consistent oddity. (well, not always, as some graphemes can be represented by either a single code point or two combined -- human language really sucks!) The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point that a character-oriented interface is not the only one that makes sense, and may not make sense at all. However: 1) Python has chosen that interface 2) It is a good interface (probably the best for computer use) if you need to choose only one utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily for utf-8 everywhere as the best option for working at the C level. That's probably true. (I also think the utf-8 fans are in a bit of a fantasy world -- this would all be easier, yes, if one encoding was used for everything, all the time, but other than that, utf-8 is not a Pancea -- we are still going to have encoding headaches no matter how you slice it) So where does numpy fit? well, it does operate at the C level, but people work with it from python, so exposing the details of the encoding to the user should be strictly opt-in. When a numpy user wants to put a string into a numpy array, they should know how long a string they can fit -- with "length" defined how python strings define it. Using utf-8 for the default string in numpy would be like using float16 for default float--not a good idea! I believe Julian said there would be no default -- you would need to specify, but I think there does need to be one: np.array(["a string", "another string"]) needs to do something. if we make a parameterized dtype that accepts any encoding, then we could do: np.array(["a string", "another string"], dtype=no.stringtype["utf-8"]) If folks really want that. I'm afraid that that would lead to errors -- cool,. utf-8 is just like ascii, but with full Unicode support! But... Numpy doesn't. If you want to access individual characters inside a > string inside an array, you have to pull out the scalar first, at which > point the data is copied and boxed into a Python object anyway, using > whatever representation the interpreter prefers. > > So AFAICT? it makes literally no difference to the user whether numpy's > internal representation allows for fast character access. > agreed - unless someone wants to do a view that makes a N-D array for strings look like a 1-D array of characters.... Which seems odd, but there was recently a big debate on the netcdf CF conventions list about that very issue... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Apr 26 18:44:03 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 26 Apr 2017 15:44:03 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: <1493231889.17161.3.camel@sipsolutions.net> References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right. I think it's really clear that you don't want to mess with the bytes in any way without knowing the encoding -- for UTF-16, the code unit is two bytes, so a "null" is two zero bytes in a row. So generic "null padded" or "null terminated" is dangerous -- it would have to be "Null-padded utf-8" or whatever. Though I > think it might have been something like "make everything in > hdf5/something similar work" That would be nice :-), but I suspect HDF-5 is the same as everything else -- there are files in the wild where someone jammed the wrong thing into a text array .... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Apr 26 19:21:26 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 26 Apr 2017 16:21:26 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern wrote: > >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many cases. > isn't UTF-32 pretty compressible also? lots of zeros in there.... here's an example with pure ascii Lorem Ipsum text: In [17]: len(text) Out[17]: 446 In [18]: len(utf8) Out[18]: 446 # the same -- it's pure ascii In [20]: len(utf32) Out[20]: 1788 # four times a big -- of course. In [22]: len(bz2.compress(utf8)) Out[22]: 302 # so from 446 to 302, not that great -- probably it would be better for longer text # -- but are compressing whole arrays or individual strings? In [23]: len(bz2.compress(utf32)) Out[23]: 319 # almost as good as the compressed utf-8 And I'm guessing it would be even closer with more non-ascii charactors. OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii charactors: In [29]: len(text) Out[29]: 672 In [30]: utf8 = text.encode("utf-8") In [31]: len(utf8) Out[31]: 1180 # not bad, really -- still smaller than utf-16 :-) In [33]: len(bz2.compress(utf8)) Out[33]: 495 # pretty good then -- better than 50% In [34]: utf32 = text.encode("utf-32") In [35]: len(utf32) Out[35]: 2692 In [36]: len(bz2.compress(utf32)) Out[36]: 515 # still not quite as good as utf-8, but close. So: utf-8 compresses better than utf-32, but only by a little bit -- at least with bz2. But it is a lot smaller uncompressed. >>> The major use case that we have for a UTF-8 array is HDF5, and it > specifies the width in bytes, not Unicode characters. > >> > >> It's not just HDF5. Counting bytes is the Right Way to measure the size > of UTF-8 encoded text: > >> http://utf8everywhere.org/#myths > It's really the only way with utf-8 -- which is why it is an impedance mismatch with python strings. >> I also firmly believe (though clearly this is not universally agreed > upon) that UTF-8 is the Right Way to encode strings for *non-legacy* > applications. > fortunately, we don't need to agree to that to agree that: > So if we're adding any new string encodings, it needs to be one of them. > Yup -- the most important one to add -- I don't think it is "The Right Way" for all applications -- but it "The Right Way" for text interchange. And regardless of what any of us think -- it is widely used. > (1) object arrays of strings. (We have these already; whether a > strings-only specialization would permit useful things like string-oriented > ufuncs is a question for someone who's willing to implement one.) > This is the right way to get variable length strings -- but I'm concerned that it doesn't mesh well with numpy uses like npz files, raw dumping of array data, etc. It should not be the only way to get proper Unicode support, nor the default when you do: array(["this", "that"]) > > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. > All python encodings should be permitted. An additional function to > truncate encoded data without mangling the encoding would be handy. > I think necessary -- at least when you pass in a python string... > I think it makes more sense for this to be NULL-padded than > NULL-terminated but it may be necessary to support both; note that > NULL-termination is complicated for encodings like UCS4. > is it if you know it's UCS4? or even know the size of the code-unit (I think that's the term) > This also includes the legacy UCS4 strings as a special case. > what's special about them? I think the only thing shold be that they are the default. > > > (3) a dtype for fixed-length byte strings. This doesn't look very > different from an array of dtype u8, but given we have the bytes type, > accessing the data this way makes sense. > > The void dtype is already there for this general purpose and mostly works, > with a few niggles. > I'd never noticed that! And if I had I never would have guessed I could use it that way. > If it worked more transparently and perhaps rigorously with `bytes`, then > it would be quite suitable. > Then we should fix a bit of those things -- and call it soemthig like "bytes", please. -CHB > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Apr 26 19:30:04 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 26 Apr 2017 16:30:04 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker wrote: > When a numpy user wants to put a string into a numpy array, they should > know how long a string they can fit -- with "length" defined how python > strings define it. > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and myself have already given), but we seem to be talking past each other here. I am still -1 on any new string encoding support unless that includes at least UTF-8, with length indicated by the number of bytes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Apr 26 19:49:29 2017 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 26 Apr 2017 16:49:29 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.debian at googlemail.com> wrote: [...] > I have read every mail and it has been a large waste of time, Everything > has been said already many times in the last few years. > Even if you memory map string arrays, of which I have not seen a > concrete use case in the mails beyond "would be nice to have" without > any backing in actual code, but I may have missed it. Yes, we have stated that FITS files with string arrays are currently being read via memory mapping. http://docs.astropy.org/en/stable/io/fits/index.html You were even pointed to a minor HDF5 implementation that memory maps: https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_ level.py#L682-L683 I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support". Since concrete examples are often helpful in focusing discussions, here's some code for reading a lab-internal EEG file format: https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py See in particular _header_dtype with its embedded string fields, and the code in _channel_names_from_header -- both of these really benefit from having a quick and easy way to talk about fixed width strings of single byte characters. (The history here of course is that the original tools for reading/writing this format are written in C, and they just read in sizeof(struct header) and cast to the header.) _get_full_string in that file is also interesting: it's a nasty hack I implemented because in some cases I actually needed *fixed width* strings, not NUL padded ones, and didn't know a better way to do it. (Yes, there's void, but I have no idea how those work. They're somehow related to buffer objects, whatever those are?) In other cases though that file really does want NUL padding. Of course that file is python 2 and blissfully ignorant of unicode. Thinking about what we'd want if porting to py3: For the "pull out this fixed width chunk of the file" problem (what _get_full_string does) then I definitely don't care about unicode; this isn't text. np.void or an array of np.uint8 aren't actually too terrible I suspect, but it'd be nice if there were a fixed-width dtype where indexing gave back a native bytes or bytearray object, or something similar like np.bytes_. For the arrays of single-byte-encoded-NUL-padded text, then the fundamental problem is just to convert between a chunk of bytes in that format and something that numpy can handle. One way to do that would be with an dtype that represented ascii-encoded-fixed-width-NUL-padded text, or any ascii-compatible encoding. But honestly I'd be just as happy with np.encode/np.decode ufuncs that converted between the existing S dtype and any kind of text array; the existing U dtype would be fine given that. The other thing that might be annoying in practice is that when writing py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on py3", but there's no dtype with similar behavior. Maybe there's no good solution and this just needs a few version-dependent convenience functions stuck in a private utility library, dunno. > What you save by having utf8 in the numpy array is replacing a decoding > ane encoding step with a stripping null padding step. > That doesn't seem very worthwhile compared to all their other overheads > involved. It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Apr 26 20:02:12 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 26 Apr 2017 17:02:12 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer wrote: > > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and > myself have already given), but we seem to be talking past each other here. > yeah -- I think it's not clear what the use cases we are talking about are. > I am still -1 on any new string encoding support unless that includes at > least UTF-8, with length indicated by the number of bytes. > I've said multiple times that utf-8 support is key to any "exchange binary data" use case (memory mapping?) -- so yes, absolutely. I _think_ this may be some of the source for the confusion: The name of this thread is: "proposal: smaller representation of string arrays". And I got the impression, maybe mistaken, that folks were suggesting that internally encoding strings in numpy as "UTF-8, with length indicated by the number of bytes." was THE solution to the " the 'U' dtype takes up way too much memory, particularly for mostly-ascii data" problem. I do not think it is a good solution to that problem. I think a good solution to that problem is latin-1 encoding. (bear with me here...) But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. For THAT -- utf-8 is critical. But if I understand Julian's proposal -- he wants to create a parameterized text dtype that you can set the encoding on, and then numpy will use the encoding (and python's machinery) to encode / decode when passing to/from python strings. It seems this would support all our desires: I'd get a latin-1 encoded type for compact representation of mostly-ascii data. Thomas would get latin-1 for binary interchange with mostly-ascii data The HDF-5 folks would get utf-8 for binary interchange (If we can workout the null-padding issue) Even folks that had weird JAVA or Windows-generated UTF-16 data files could do the binary interchange thing.... I'm now lost as to what the hang-up is. -CHB PS: null padding is a pain, python strings seem to preserve the zeros, whic is odd -- is thre a unicode code-point at x00? But you can use it to strip properly with the unicode sandwich: In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00' In [64]: ut16.decode('utf-16') Out[64]: 'some text\x00\x00\x00' In [65]: ut16.decode('utf-16').strip('\x00') Out[65]: 'some text' In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16') Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00' -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Apr 26 20:08:30 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 17:08:30 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > > On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: >> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. > > I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/379 -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Apr 26 20:17:29 2017 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 26 Apr 2017 17:17:29 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker wrote: > But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: > > * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. > > For THAT -- utf-8 is critical. > > But if I understand Julian's proposal -- he wants to create a parameterized text dtype that you can set the encoding on, and then numpy will use the encoding (and python's machinery) to encode / decode when passing to/from python strings. > > It seems this would support all our desires: > > I'd get a latin-1 encoded type for compact representation of mostly-ascii data. > > Thomas would get latin-1 for binary interchange with mostly-ascii data > > The HDF-5 folks would get utf-8 for binary interchange (If we can workout the null-padding issue) > > Even folks that had weird JAVA or Windows-generated UTF-16 data files could do the binary interchange thing.... > > I'm now lost as to what the hang-up is. The proposal is for only latin-1 and UTF-32 to be supported at first, and the eventual support of UTF-8 will be constrained by specification of the width in terms of characters rather than bytes, which conflicts with the use cases of UTF-8 that have been brought forth. https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Apr 26 20:50:22 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 26 Apr 2017 17:50:22 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <8741041756854148453@unknownmsgid> <1229716955908306730@unknownmsgid> <-5378706506035339722@unknownmsgid> Message-ID: On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern wrote: > The proposal is for only latin-1 and UTF-32 to be supported at first, and > the eventual support of UTF-8 will be constrained by specification of the > width in terms of characters rather than bytes, which conflicts with the > use cases of UTF-8 that have been brought forth. > > https://mail.python.org/pipermail/numpy-discussion/ > 2017-April/076668.html > thanks -- I had forgotten (clearly) it was that limited. But my question now is -- if there is a encoding-parameterized string dtype, then is it much more effort to have it support all the encodings in the stdlib? It seems that would solve everyone's issue. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Apr 26 21:34:41 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 26 Apr 2017 18:34:41 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > It's worthwhile enough that both major HDF5 bindings don't support Unicode > arrays, despite user requests for years. The sticking point seems to be the > difference between HDF5's view of a Unicode string array (defined in size > by the bytes of UTF-8 data) and numpy's current view of a Unicode string > array (because of UCS-4, defined by the number of > characters/codepoints/whatever). So there are HDF5 files out there that > none of our HDF5 bindings can read, and it is impossible to write certain > data efficiently. > > > I would really like to hear more from the authors of these libraries about > what exactly it is they feel they're missing. Is it that they want numpy to > enforce the length limit early, to catch errors when the array is modified > instead of when they go to write it to the file? Is it that they really > want an O(1) way to look at a array and know the maximum number of bytes > needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is > really annoying and files that need it are rare so they haven't had the > motivation to implement it? My impression is similar to Julian's: you > *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few > dozen lines of code, which is nothing compared to all the other hoops these > libraries are already jumping through, so if this is really the roadblock > then I must be missing something. > I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix. We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support. For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings. So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From faltet at gmail.com Thu Apr 27 07:10:42 2017 From: faltet at gmail.com (Francesc Alted) Date: Thu, 27 Apr 2017 13:10:42 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > >> It's worthwhile enough that both major HDF5 bindings don't support >> Unicode arrays, despite user requests for years. The sticking point seems >> to be the difference between HDF5's view of a Unicode string array (defined >> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >> string array (because of UCS-4, defined by the number of >> characters/codepoints/whatever). So there are HDF5 files out there that >> none of our HDF5 bindings can read, and it is impossible to write certain >> data efficiently. >> >> >> I would really like to hear more from the authors of these libraries >> about what exactly it is they feel they're missing. Is it that they want >> numpy to enforce the length limit early, to catch errors when the array is >> modified instead of when they go to write it to the file? Is it that they >> really want an O(1) way to look at a array and know the maximum number of >> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >> is really annoying and files that need it are rare so they haven't had the >> motivation to implement it? My impression is similar to Julian's: you >> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >> dozen lines of code, which is nothing compared to all the other hoops these >> libraries are already jumping through, so if this is really the roadblock >> then I must be missing something. >> > > I actually agree with you. I think it's mostly a matter of convenience > that h5py matched up HDF5 dtypes with numpy dtypes: > fixed width ASCII -> np.string_/bytes > variable length ASCII -> object arrays of np.string_/bytes > variable length UTF-8 -> object arrays of unicode > > This was tenable in a Python 2 world, but on Python 3 it's broken and > there's not an easy fix. > > We absolutely could fix h5py by mapping everything to object arrays of > Python unicode strings, as has been discussed ( > https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would > be a fine but non-ideal solution, since there is currently no fixed width > UTF-8 support. > > For fixed width ASCII arrays, this would mean increased convenience for > Python 3 users, at the price of decreased convenience for Python 2 users > (arrays now contain boxed Python objects), unless we made the h5py behavior > dependent on the version of Python. Hence, we're back here, waiting for > better dtypes for encoded strings. > > So for HDF5, I see good use cases for ASCII-with-surrogateescape (for > handling ASCII arrays as strings) and UTF-8 with length equal to the number > of bytes. > Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :) ? As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter). I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex). ?Francesc ? > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Thu Apr 27 07:27:31 2017 From: ndbecker2 at gmail.com (Neal Becker) Date: Thu, 27 Apr 2017 11:27:31 +0000 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it? On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > >> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: >> >>> It's worthwhile enough that both major HDF5 bindings don't support >>> Unicode arrays, despite user requests for years. The sticking point seems >>> to be the difference between HDF5's view of a Unicode string array (defined >>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >>> string array (because of UCS-4, defined by the number of >>> characters/codepoints/whatever). So there are HDF5 files out there that >>> none of our HDF5 bindings can read, and it is impossible to write certain >>> data efficiently. >>> >>> >>> I would really like to hear more from the authors of these libraries >>> about what exactly it is they feel they're missing. Is it that they want >>> numpy to enforce the length limit early, to catch errors when the array is >>> modified instead of when they go to write it to the file? Is it that they >>> really want an O(1) way to look at a array and know the maximum number of >>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >>> is really annoying and files that need it are rare so they haven't had the >>> motivation to implement it? My impression is similar to Julian's: you >>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >>> dozen lines of code, which is nothing compared to all the other hoops these >>> libraries are already jumping through, so if this is really the roadblock >>> then I must be missing something. >>> >> >> I actually agree with you. I think it's mostly a matter of convenience >> that h5py matched up HDF5 dtypes with numpy dtypes: >> fixed width ASCII -> np.string_/bytes >> variable length ASCII -> object arrays of np.string_/bytes >> variable length UTF-8 -> object arrays of unicode >> >> This was tenable in a Python 2 world, but on Python 3 it's broken and >> there's not an easy fix. >> >> We absolutely could fix h5py by mapping everything to object arrays of >> Python unicode strings, as has been discussed ( >> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >> would be a fine but non-ideal solution, since there is currently no fixed >> width UTF-8 support. >> >> For fixed width ASCII arrays, this would mean increased convenience for >> Python 3 users, at the price of decreased convenience for Python 2 users >> (arrays now contain boxed Python objects), unless we made the h5py behavior >> dependent on the version of Python. Hence, we're back here, waiting for >> better dtypes for encoded strings. >> >> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >> handling ASCII arrays as strings) and UTF-8 with length equal to the number >> of bytes. >> > > Well, I'll say upfront that I have not read this discussion in the fully, > but apparently some opinions from developers of HDF5 Python packages would > be welcome here, so here I go :) ? > > As a long-time developer of one of the Python HDF5 packages (PyTables), I > have always been of the opinion that plain ASCII (for byte strings) and > UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing > large amounts of data, most specially for disk storage (but also using > compressed in-memory containers). My rational is that, although UCS-4 may > require way too much space, compression would reduce that to basically the > space that is required by compressed UTF-8 (I won't go into detail, but > basically this is possible by using the shuffle filter). > > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think there is a go back (not even adding UCS-4 support on it, > although I continue to think it would be a good idea). So, I suppose that > if HDF5 is found to be an important format for NumPy users (and I think > this is the case), a solution for representing Unicode characters by using > UTF-8 in NumPy would be desirable (at the risk of making the implementation > more complex). > > ?Francesc > ? > >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > > -- > Francesc Alted > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From faltet at gmail.com Thu Apr 27 07:38:06 2017 From: faltet at gmail.com (Francesc Alted) Date: Thu, 27 Apr 2017 13:38:06 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: 2017-04-27 13:27 GMT+02:00 Neal Becker : > So while compression+ucs-4 might be OK for out-of-core representation, > what about in-core? blosc+ucs-4? I don't think that works for mmap, does > it? > ?Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-memory, there are several compressed data containers, like: https://github.com/alimanfoo/zarr (meant mainly for multidimensional data containers) https://github.com/Blosc/bcolz? (meant mainly for tabular data containers) ?(there might be others).? > > On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > >> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : >> >>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: >>> >>>> It's worthwhile enough that both major HDF5 bindings don't support >>>> Unicode arrays, despite user requests for years. The sticking point seems >>>> to be the difference between HDF5's view of a Unicode string array (defined >>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >>>> string array (because of UCS-4, defined by the number of >>>> characters/codepoints/whatever). So there are HDF5 files out there >>>> that none of our HDF5 bindings can read, and it is impossible to write >>>> certain data efficiently. >>>> >>>> >>>> I would really like to hear more from the authors of these libraries >>>> about what exactly it is they feel they're missing. Is it that they want >>>> numpy to enforce the length limit early, to catch errors when the array is >>>> modified instead of when they go to write it to the file? Is it that they >>>> really want an O(1) way to look at a array and know the maximum number of >>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >>>> is really annoying and files that need it are rare so they haven't had the >>>> motivation to implement it? My impression is similar to Julian's: you >>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >>>> dozen lines of code, which is nothing compared to all the other hoops these >>>> libraries are already jumping through, so if this is really the roadblock >>>> then I must be missing something. >>>> >>> >>> I actually agree with you. I think it's mostly a matter of convenience >>> that h5py matched up HDF5 dtypes with numpy dtypes: >>> fixed width ASCII -> np.string_/bytes >>> variable length ASCII -> object arrays of np.string_/bytes >>> variable length UTF-8 -> object arrays of unicode >>> >>> This was tenable in a Python 2 world, but on Python 3 it's broken and >>> there's not an easy fix. >>> >>> We absolutely could fix h5py by mapping everything to object arrays of >>> Python unicode strings, as has been discussed ( >>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >>> would be a fine but non-ideal solution, since there is currently no fixed >>> width UTF-8 support. >>> >>> For fixed width ASCII arrays, this would mean increased convenience for >>> Python 3 users, at the price of decreased convenience for Python 2 users >>> (arrays now contain boxed Python objects), unless we made the h5py behavior >>> dependent on the version of Python. Hence, we're back here, waiting for >>> better dtypes for encoded strings. >>> >>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >>> handling ASCII arrays as strings) and UTF-8 with length equal to the number >>> of bytes. >>> >> >> Well, I'll say upfront that I have not read this discussion in the fully, >> but apparently some opinions from developers of HDF5 Python packages would >> be welcome here, so here I go :) ? >> >> As a long-time developer of one of the Python HDF5 packages (PyTables), I >> have always been of the opinion that plain ASCII (for byte strings) and >> UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing >> large amounts of data, most specially for disk storage (but also using >> compressed in-memory containers). My rational is that, although UCS-4 may >> require way too much space, compression would reduce that to basically the >> space that is required by compressed UTF-8 (I won't go into detail, but >> basically this is possible by using the shuffle filter). >> >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I >> don't think there is a go back (not even adding UCS-4 support on it, >> although I continue to think it would be a good idea). So, I suppose that >> if HDF5 is found to be an important format for NumPy users (and I think >> this is the case), a solution for representing Unicode characters by using >> UTF-8 in NumPy would be desirable (at the risk of making the implementation >> more complex). >> >> ?Francesc >> ? >> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >> >> >> -- >> Francesc Alted >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Apr 27 12:18:47 2017 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 27 Apr 2017 09:18:47 -0700 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think there is a go back > This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently. I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format. So I think numpy needs to support at least: utf-8 latin-1 UCS-4 And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and .... So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not? (I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code) If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases. As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type. The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects?? @Francesc: -- one more question for you: How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From faltet at gmail.com Thu Apr 27 12:57:03 2017 From: faltet at gmail.com (Francesc Alted) Date: Thu, 27 Apr 2017 18:57:03 +0200 Subject: [Numpy-discussion] proposal: smaller representation of string arrays In-Reply-To: References: <1229716955908306730@unknownmsgid> <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com> <1493231889.17161.3.camel@sipsolutions.net> Message-ID: 2017-04-27 18:18 GMT+02:00 Chris Barker : > On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I >> don't think there is a go back >> > > This is the key point -- we can argue all we want about the best encoding > for fixed-length unicode-supporting strings (I think numpy and HDF have > very similar requirements), but that is not our decision to make -- many > other systems have chosen utf-8, so it's a really good idea for numpy to be > able to deal with that cleanly and easily and consistently. > ?Agreed. But it would also be a good idea to spread the word that simple UCS4 encoding in combination with compression can be a perfectly good system? for storing large amounts of unicode data too. > > I have made many anti utf-8 points in this thread because while we need to > deal with utf-8 for interplay with other systems, I am very sure that it is > not the best format for a default, naive-user-of-numpy unicode-supporting > dtype. Nor is it the best encoding for a mostly-ascii compact in memory > format. > ?I resonate a lot with this feeling too :)? > > So I think numpy needs to support at least: > > utf-8 > latin-1 > UCS-4 > > And it maybe should support one-byte encoding suitable for non-european > languages, and maybe utf-16 for Java and Windows compatibility, and .... > > So that seems to point to "support as many encodings as possible" And > python has the machinery to do so -- so why not? > > (I'm taking Julian's word for it that having a parameterized dtype would > not have a major impact on current code) > > If we go with a parameterized by encoding string dtype, then we can pick > sensible defaults, and let users use what they know best fits their > use-cases. > > As for python2 -- it is on the way out, I think we should keep the 'U' and > 'S' dtypes as they are for backward compatibility and move forward with the > new one(s) in a way that is optimized for py3. And it would map to a py2 > Unicode type. > > The only catch I see in that is what to do with bytes -- we should have a > numpy dtype that matches the bytes model -- fixed length bytes that map to > python bytes objects. (this is almost what teh void type is yes?) but then > under py2, would a bytes object (py2 string) map to numpy 'S' or numpy > bytes objects?? > > @Francesc: -- one more question for you: > > How important is it for pytables to match the numpy storage to the hdf > storage byte for byte? i.e. would it be a killer if encoding / decoding > happened every time at the boundary? I'm guessing yes, as this would have > been solved long ago if not. > ?The PyTables team decided some time ago that it was a waste of time and resources to maintain the internal HDF5 interface, and that it would be better to switch to h5py for the low I/O communication with HDF5 (btw, we just received? a small NumFOCUS grant for continue the ongoing work on this; thanks guys!). This means that PyTables will be basically agnostic about this sort of encoding issues, and that the important package to have in account for interfacing NumPy and HDF5 is just h5py. -- Francesc Alted -------------- next part -------------- An HTML attachment was scrubbed... URL: From opossumnano at gmail.com Fri Apr 28 05:29:38 2017 From: opossumnano at gmail.com (Tiziano Zito) Date: Fri, 28 Apr 2017 02:29:38 -0700 (PDT) Subject: [Numpy-discussion] =?utf-8?b?W0FOTl0gMTDhtYDhtLQgQWR2YW5jZWQgU2Np?= =?utf-8?q?entific_Programming_in_Python_in_Nikiti=2C_Greece=2C_August_28?= =?utf-8?q?=E2=80=94September_2=2C_2017?= Message-ID: <59030b82.ea85df0a.38a89.24cc@mx.google.com> 10?? Advanced Scientific Programming in Python ============================================== a Summer School by the G-Node and the Municipality of Sithonia Scientists spend more and more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices which are standard in the industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project ? an entertaining computer game. We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to become a standard tool for the programming scientist. This school is targeted at Master or PhD students and Post-docs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course. We are striving hard to get a pool of students which is international and gender-balanced. You can apply online: https://python.g-node.org Application deadline: 23:59 UTC, May 31, 2017. There will be no deadline extension, so be sure to apply on time ;-) Be sure to read the FAQ before applying. Participation is for free, i.e. no fee is charged! Participants however should take care of travel, living, and accommodation expenses by themselves. Date & Location =============== August 28?September 2, 2017. Nikiti, Sithonia, Halkidiki, Greece Program ======= ? Best Programming Practices ? Best practices for scientific programming ? Version control with git and how to contribute to open source projects with GitHub ? Best practices in data visualization ? Software Carpentry ? Test-driven development ? Debugging with a debuggger ? Profiling code ? Scientific Tools for Python ? Advanced NumPy ? Advanced Python ? Decorators ? Context managers ? Generators ? The Quest for Speed ? Writing parallel applications ? Interfacing to C with Cython ? Memory-bound problems and memory profiling ? Data containers: storage and fast access to large data ? Practical Software Development ? Group project Preliminary Faculty =================== ? Francesc Alted, freelance consultant, author of Blosc, Castell? de la Plana, Spain ? Pietro Berkes, NAGRA Kudelski, Lausanne, Switzerland ? Zbigniew J?drzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, VA USA ? Eilif Muller, Blue Brain Project, ?cole Polytechnique F?d?rale de Lausanne Switzerland ? Juan Nunez-Iglesias, Victorian Life Sciences Computation Initiative, University of Melbourne, Australia ? Rike-Benjamin Schuppner, Institute for Theoretical Biology, Humboldt-Universit?t zu Berlin, Germany ? Nicolas P. Rougier, Inria Bordeaux Sud-Ouest, Institute of Neurodegenerative Disease, University of Bordeaux, France ? Bartosz Tele?czuk, European Institute for Theoretical Neuroscience, CNRS, Paris, France ? St?fan van der Walt, Berkeley Institute for Data Science, UC Berkeley, CA USA ? Nelle Varoquaux, Berkeley Institute for Data Science, UC Berkeley, CA USA ? Tiziano Zito, freelance consultant, Berlin, Germany Organizers ========== For the German Neuroinformatics Node of the INCF (G-Node) Germany: ? Tiziano Zito, freelance consultant, Berlin, Germany ? Zbigniew J?drzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, USA ? Jakob Jordan, Institute of Neuroscience and Medicine (INM-6), Forschungszentrum J?lich GmbH, Germany ? Etienne Roesch, Centre for Integrative Neuroscience and Neurodynamics, University of Reading, UK Website: https://python.g-node.org Contact: python-info at g-node.org From harrigan.matthew at gmail.com Fri Apr 28 12:53:54 2017 From: harrigan.matthew at gmail.com (Matthew Harrigan) Date: Fri, 28 Apr 2017 12:53:54 -0400 Subject: [Numpy-discussion] [NumPy-discussion] Wish List of Possible ufunc Enhancements Message-ID: Here is a link to a wish list of possible ufunc enhancements. I would like to know what the community thinks. Thank you, Matt Harrigan -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sat Apr 29 01:29:45 2017 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 28 Apr 2017 22:29:45 -0700 Subject: [Numpy-discussion] [NumPy-discussion] Wish List of Possible ufunc Enhancements In-Reply-To: References: Message-ID: On Fri, Apr 28, 2017 at 9:53 AM, Matthew Harrigan wrote: > Here is a link to a wish list of possible ufunc enhancements. I would like > to know what the community thinks. It looks like a pretty good list of ideas worth thinking about as and when someone has time :-). I'm not sure what feedback you're looking for beyond that? Do you have a purpose in mind for this list? The main thing I'd add is: making it possible for ufunc core loops to access the dtype object. This is the main blocker on a *lot* of things, probably more so than anything else on that list, because it would allow ufunc operations to be defined for parametrized dtypes like the S and U dtypes, categorical data, etc. -n -- Nathaniel J. Smith -- https://vorpus.org