From jtaylor.debian at googlemail.com  Mon Apr  3 07:28:22 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Mon, 3 Apr 2017 13:28:22 +0200
Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available
 to NumFOCUS projects (sponsored & affiliated)
In-Reply-To: <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com>
References: <CAFhTXRPCDkMxa450j=6Z8vhzsiOgZjV9r-M+qF+FoF861fxyiw@mail.gmail.com>
 <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com>
 <CAFhTXRPZbViptfLkjEMxNstCPNt+24JX9LEnzkW3daczUbU3uA@mail.gmail.com>
 <CABL7CQiE-sCsU16d=d6xDOjuft8Z25h1TrYo49fELXBo9KLAmA@mail.gmail.com>
 <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com>
 <CABL7CQiTM8_wfXqq8wHyh4DsVY1nVM_G35g8DvjaUp0e9g9PHQ@mail.gmail.com>
 <CABL7CQhDiL1H8nMqSf7wt-zpXnHS-bN_Yv7Zw1fK45cgw-UFBg@mail.gmail.com>
 <CAPJVwBn409w7UNLU_ALL+cFtE=1S_gACnFQx+mrAsQ1AQ5osEQ@mail.gmail.com>
 <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com>
Message-ID: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com>

On 31.03.2017 16:07, Julian Taylor wrote:
> On 31.03.2017 15:51, Nathaniel Smith wrote:
>> On Mar 31, 2017 1:15 AM, "Ralf Gommers" <ralf.gommers at gmail.com
>> <mailto:ralf.gommers at gmail.com>> wrote:
>>
>>
>>
>>     On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers
>>     <ralf.gommers at gmail.com <mailto:ralf.gommers at gmail.com>> wrote:
>>
>>
>>
>>         On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor
>>         <jtaylor.debian at googlemail.com
>>         <mailto:jtaylor.debian at googlemail.com>> wrote:
>>
>>             I have two ideas under one big important topic: make numpy
>>             python3
>>             compatible.
>>
>>             The first fits pretty well with the grant size and nobody
>>             wants to do it
>>             for free:
>>             - fix our text IO functions under python3 and support multiple
>>             encodings, not only latin1.
>>             Reasonably simple to do, slap encoding arguments on the
>>             functions,
>>             generate test cases and somehow keep backward compatibility.
>>             Some
>>             prelimary unfinished work is in
>>             https://github.com/numpy/numpy/pull/4208
>>             <https://github.com/numpy/numpy/pull/4208>
>>
>>
>>         I like that idea, it's a recurring pain point. Are you
>>         interested to work on it, or are you thinking to advertise the
>>         idea here to see if anyone steps up?
>>
>>
>>     More thoughts on this anyone? Or preferences for this idea or the
>>     numpy.org <http://numpy.org> one? Submission deadline is April 3rd
>>     and we can only put in one proposal this time, so we need to (a)
>>     make a choice between these ideas, and (b) write up a proposal.
>>
>>     If there's not enough replies to this so the choice is clear cut, I
>>     will send out a poll to the core devs.
>>
>>
>> Do we have anyone interested in doing the work in either case? That
>> seems like the most important consideration to me...
>>
>> -n
>>
> 
> I could do the textio thing if no one shows up for numpy.org. I can
> probably check again what is required in the next few days and write a
> proposal.
> The change will need reviewing in the end too, should that be
> compensated too? It feels weird if not.
> 

I have decided to not do it, as it is more or less just a bugfix and I
currently do not feel capable of doing with added completion pressure.
But I have collected some of related issues and discussions:

https://github.com/numpy/numpy/issues/4600
https://github.com/numpy/numpy/issues/3184
http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003
# loadtxt
https://github.com/numpy/numpy/pull/4208
# genfromtxt
http://numpy-discussion.10968.n7.nabble.com/genfromtxt-universal-newline-support-td37816.html
https://github.com/dhomeier/numpy/commit/995ec93

From renato.fabbri at gmail.com  Mon Apr  3 08:14:56 2017
From: renato.fabbri at gmail.com (Renato Fabbri)
Date: Mon, 3 Apr 2017 09:14:56 -0300
Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available
 to NumFOCUS projects (sponsored & affiliated)
In-Reply-To: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com>
References: <CAFhTXRPCDkMxa450j=6Z8vhzsiOgZjV9r-M+qF+FoF861fxyiw@mail.gmail.com>
 <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com>
 <CAFhTXRPZbViptfLkjEMxNstCPNt+24JX9LEnzkW3daczUbU3uA@mail.gmail.com>
 <CABL7CQiE-sCsU16d=d6xDOjuft8Z25h1TrYo49fELXBo9KLAmA@mail.gmail.com>
 <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com>
 <CABL7CQiTM8_wfXqq8wHyh4DsVY1nVM_G35g8DvjaUp0e9g9PHQ@mail.gmail.com>
 <CABL7CQhDiL1H8nMqSf7wt-zpXnHS-bN_Yv7Zw1fK45cgw-UFBg@mail.gmail.com>
 <CAPJVwBn409w7UNLU_ALL+cFtE=1S_gACnFQx+mrAsQ1AQ5osEQ@mail.gmail.com>
 <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com>
 <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com>
Message-ID: <CALqdYOwgMFKpzeUsqcPNG7YoPu39JsrN8XQi36+BB+EADosDVg@mail.gmail.com>

maybe OT, but is has become recurrent to me for already some years
to make a very simple module for obtaining arrays related to musical
elements.
All here:
https://github.com/ttm/dissertacao
scripts/ have Python/Numpy has implementions of the musical elements.
dissertacaoCorrigida.pdf holds a thorough description of the framework.

I idealize it as a module inside Numpy
but I understand it might be reasonable to do it as a Scipy kit.

I handed my doctorate a few days ago and might be willing to
put some time into this.

PS. long time no post. Hello!


On Mon, Apr 3, 2017 at 8:28 AM, Julian Taylor <jtaylor.debian at googlemail.com
> wrote:

> On 31.03.2017 16:07, Julian Taylor wrote:
> > On 31.03.2017 15:51, Nathaniel Smith wrote:
> >> On Mar 31, 2017 1:15 AM, "Ralf Gommers" <ralf.gommers at gmail.com
> >> <mailto:ralf.gommers at gmail.com>> wrote:
> >>
> >>
> >>
> >>     On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers
> >>     <ralf.gommers at gmail.com <mailto:ralf.gommers at gmail.com>> wrote:
> >>
> >>
> >>
> >>         On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor
> >>         <jtaylor.debian at googlemail.com
> >>         <mailto:jtaylor.debian at googlemail.com>> wrote:
> >>
> >>             I have two ideas under one big important topic: make numpy
> >>             python3
> >>             compatible.
> >>
> >>             The first fits pretty well with the grant size and nobody
> >>             wants to do it
> >>             for free:
> >>             - fix our text IO functions under python3 and support
> multiple
> >>             encodings, not only latin1.
> >>             Reasonably simple to do, slap encoding arguments on the
> >>             functions,
> >>             generate test cases and somehow keep backward compatibility.
> >>             Some
> >>             prelimary unfinished work is in
> >>             https://github.com/numpy/numpy/pull/4208
> >>             <https://github.com/numpy/numpy/pull/4208>
> >>
> >>
> >>         I like that idea, it's a recurring pain point. Are you
> >>         interested to work on it, or are you thinking to advertise the
> >>         idea here to see if anyone steps up?
> >>
> >>
> >>     More thoughts on this anyone? Or preferences for this idea or the
> >>     numpy.org <http://numpy.org> one? Submission deadline is April 3rd
> >>     and we can only put in one proposal this time, so we need to (a)
> >>     make a choice between these ideas, and (b) write up a proposal.
> >>
> >>     If there's not enough replies to this so the choice is clear cut, I
> >>     will send out a poll to the core devs.
> >>
> >>
> >> Do we have anyone interested in doing the work in either case? That
> >> seems like the most important consideration to me...
> >>
> >> -n
> >>
> >
> > I could do the textio thing if no one shows up for numpy.org. I can
> > probably check again what is required in the next few days and write a
> > proposal.
> > The change will need reviewing in the end too, should that be
> > compensated too? It feels weird if not.
> >
>
> I have decided to not do it, as it is more or less just a bugfix and I
> currently do not feel capable of doing with added completion pressure.
> But I have collected some of related issues and discussions:
>
> https://github.com/numpy/numpy/issues/4600
> https://github.com/numpy/numpy/issues/3184
> http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-
> to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003
> # loadtxt
> https://github.com/numpy/numpy/pull/4208
> # genfromtxt
> http://numpy-discussion.10968.n7.nabble.com/genfromtxt-
> universal-newline-support-td37816.html
> https://github.com/dhomeier/numpy/commit/995ec93
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 
Renato Fabbri
GNU/Linux User #479299
labmacambira.sourceforge.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/3d2871f6/attachment.html>

From pierre.haessig at crans.org  Mon Apr  3 09:20:56 2017
From: pierre.haessig at crans.org (Pierre Haessig)
Date: Mon, 3 Apr 2017 15:20:56 +0200
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
Message-ID: <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>

Hello,

Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit :
> [....]
>
> But how come Julia is 4-5x faster since Numpy uses C implementation
> for the entire process ? (Mersenne Twister -> uniform double ->
> Box-Muller transform to get a Gaussian
> https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c).
> Also I noticed that Julia uses a different algorithm (Ziggurat Method
> from Marsaglia and Tsang ,
> https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700)
> but this doesn't explain the difference for uniform rng.
>
Any ideas?

Do you think Stackoverflow would be a better place for my question?

best,

Pierre

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/ec4575ac/attachment-0001.html>

From jaime.frio at gmail.com  Mon Apr  3 09:44:36 2017
From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=)
Date: Mon, 3 Apr 2017 15:44:36 +0200
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
Message-ID: <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>

On Mon, Apr 3, 2017 at 3:20 PM, Pierre Haessig <pierre.haessig at crans.org>
wrote:

> Hello,
> Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit :
>
> [....]
>
> But how come Julia is 4-5x faster since Numpy uses C implementation for
> the entire process ? (Mersenne Twister -> uniform double -> Box-Muller
> transform to get a Gaussian https://github.com/numpy/
> numpy/blob/master/numpy/random/mtrand/randomkit.c). Also I noticed that
> Julia uses a different algorithm (Ziggurat Method from Marsaglia and Tsang
> , https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700) but
> this doesn't explain the difference for uniform rng.
>
> Any ideas?
>

This
<https://github.com/JuliaLang/julia/blob/7fb758a275a0b4cf0e3f4cbf482c065cb32f0011/doc/src/stdlib/numbers.md#L116>
says
that Julia uses this library
<http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/#dSFMT>, which is
different from the home brewed version of the Mersenne twister in NumPy.
The second link I posted claims their speed comes from generating double
precision numbers directly, rather than generating random bytes that have
to be converted to doubles, as is the case of NumPy through this magical
incantation
<https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L514>.
They also throw the SIMD acronym around, which likely means their random
number generation is parallelized.

My guess is that most of the speed-up comes from the SIMD parallelization:
the Mersenne algorithm does a lot of work
<https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L221>
to
produce 32 random bits, so that likely dominates over a couple of
arithmetic operations, even if divisions are involved.

Jaime

Do you think Stackoverflow would be a better place for my question?
>
> best,
>
> Pierre
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes
de dominaci?n mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/4aaf0f0c/attachment.html>

From ndbecker2 at gmail.com  Mon Apr  3 09:52:06 2017
From: ndbecker2 at gmail.com (Neal Becker)
Date: Mon, 03 Apr 2017 13:52:06 +0000
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
Message-ID: <CAG3t+pGC-uR4POwRtMGn9x2PmjO6=zA11Wr-_Y3cLABY6r3dbQ@mail.gmail.com>

Take a look here:
https://bashtage.github.io/ng-numpy-randomstate/doc/index.html

On Mon, Apr 3, 2017 at 9:45 AM Jaime Fern?ndez del R?o <jaime.frio at gmail.com>
wrote:

> On Mon, Apr 3, 2017 at 3:20 PM, Pierre Haessig <pierre.haessig at crans.org>
> wrote:
>
> Hello,
> Le 30/03/2017 ? 13:31, Pierre Haessig a ?crit :
>
> [....]
>
> But how come Julia is 4-5x faster since Numpy uses C implementation for
> the entire process ? (Mersenne Twister -> uniform double -> Box-Muller
> transform to get a Gaussian
> https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c).
> Also I noticed that Julia uses a different algorithm (Ziggurat Method
> from Marsaglia and Tsang ,
> https://github.com/JuliaLang/julia/blob/master/base/random.jl#L700) but
> this doesn't explain the difference for uniform rng.
>
> Any ideas?
>
>
> This
> <https://github.com/JuliaLang/julia/blob/7fb758a275a0b4cf0e3f4cbf482c065cb32f0011/doc/src/stdlib/numbers.md#L116> says
> that Julia uses this library
> <http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/#dSFMT>, which is
> different from the home brewed version of the Mersenne twister in NumPy.
> The second link I posted claims their speed comes from generating double
> precision numbers directly, rather than generating random bytes that have
> to be converted to doubles, as is the case of NumPy through this magical
> incantation
> <https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L514>.
> They also throw the SIMD acronym around, which likely means their random
> number generation is parallelized.
>
> My guess is that most of the speed-up comes from the SIMD parallelization:
> the Mersenne algorithm does a lot of work
> <https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L221> to
> produce 32 random bits, so that likely dominates over a couple of
> arithmetic operations, even if divisions are involved.
>
> Jaime
>
> Do you think Stackoverflow would be a better place for my question?
>
> best,
>
> Pierre
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
>
> --
> (\__/)
> ( O.o)
> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes
> de dominaci?n mundial.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/60cd85ea/attachment.html>

From pierre.haessig at crans.org  Mon Apr  3 11:46:58 2017
From: pierre.haessig at crans.org (Pierre Haessig)
Date: Mon, 3 Apr 2017 17:46:58 +0200
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <CAG3t+pGC-uR4POwRtMGn9x2PmjO6=zA11Wr-_Y3cLABY6r3dbQ@mail.gmail.com>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
 <CAG3t+pGC-uR4POwRtMGn9x2PmjO6=zA11Wr-_Y3cLABY6r3dbQ@mail.gmail.com>
Message-ID: <a6475594-39b1-070e-ff60-30c079dd787b@crans.org>


Le 03/04/2017 ? 15:52, Neal Becker a ?crit :
> Take a look here:
> https://bashtage.github.io/ng-numpy-randomstate/doc/index.html
Thanks for the pointer. A very feature-full random generator package.

So it is indeed possible to have in Python/Numpy both the "advanced"
Mersenne Twister (dSFMT) at the lower level and the Ziggurat algorithm
for Gaussian transform on top. Perfect!

In an ideal world, this would be implemented by default in Numpy, but I
understand that this would break the reproducibility of existing codes.

best,
Pierre

From ndbecker2 at gmail.com  Mon Apr  3 11:49:19 2017
From: ndbecker2 at gmail.com (Neal Becker)
Date: Mon, 03 Apr 2017 15:49:19 +0000
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <a6475594-39b1-070e-ff60-30c079dd787b@crans.org>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
 <CAG3t+pGC-uR4POwRtMGn9x2PmjO6=zA11Wr-_Y3cLABY6r3dbQ@mail.gmail.com>
 <a6475594-39b1-070e-ff60-30c079dd787b@crans.org>
Message-ID: <CAG3t+pF0eWJOtM-GK7Kw5cG3ej3akykAVmZBXvJOvw5vLEPcWQ@mail.gmail.com>

I think the intention is that this is the next gen of numpy randomstate,
and will eventually be merged in.

On Mon, Apr 3, 2017 at 11:47 AM Pierre Haessig <pierre.haessig at crans.org>
wrote:

>
> Le 03/04/2017 ? 15:52, Neal Becker a ?crit :
> > Take a look here:
> > https://bashtage.github.io/ng-numpy-randomstate/doc/index.html
> Thanks for the pointer. A very feature-full random generator package.
>
> So it is indeed possible to have in Python/Numpy both the "advanced"
> Mersenne Twister (dSFMT) at the lower level and the Ziggurat algorithm
> for Gaussian transform on top. Perfect!
>
> In an ideal world, this would be implemented by default in Numpy, but I
> understand that this would break the reproducibility of existing codes.
>
> best,
> Pierre
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/1925f6b9/attachment.html>

From pierre.haessig at crans.org  Mon Apr  3 11:59:22 2017
From: pierre.haessig at crans.org (Pierre Haessig)
Date: Mon, 3 Apr 2017 17:59:22 +0200
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
Message-ID: <be89c280-dd84-2341-acdb-6ea5c207d1cc@crans.org>

Le 03/04/2017 ? 15:44, Jaime Fern?ndez del R?o a ?crit :
> This
> <https://github.com/JuliaLang/julia/blob/7fb758a275a0b4cf0e3f4cbf482c065cb32f0011/doc/src/stdlib/numbers.md#L116> says
> that Julia uses this library
> <http://www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/SFMT/#dSFMT>, which
> is different from the home brewed version of the Mersenne twister in
> NumPy. The second link I posted claims their speed comes from
> generating double precision numbers directly, rather than generating
> random bytes that have to be converted to doubles, as is the case of
> NumPy through this magical incantation
> <https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L514>.
> They also throw the SIMD acronym around, which likely means their
> random number generation is parallelized.
>
> My guess is that most of the speed-up comes from the SIMD
> parallelization: the Mersenne algorithm does a lot of work
> <https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L221> to
> produce 32 random bits, so that likely dominates over a couple of
> arithmetic operations, even if divisions are involved.
Thanks for the feedback.

I'm not good in enough in reading Julia to be 100% sure, but I feel like
that the random.jl
(https://github.com/JuliaLang/julia/blob/master/base/random.jl) contains
a Julia implementation of Mersenne Twister... but I have no idea whether
it is the "fancy" SIMD version or the "old" 32bits version.

best,
Pierre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/e0754db5/attachment.html>

From njs at pobox.com  Mon Apr  3 12:33:13 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Mon, 3 Apr 2017 09:33:13 -0700
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <CAPJVwBk959=cLQrj9ZDMnRCfS-4JAVTpaNgHpvC4HQdt8fhoOw@mail.gmail.com>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
 <be89c280-dd84-2341-acdb-6ea5c207d1cc@crans.org>
 <CAPJVwBm7HrH1r0EFoL_ekBU7jmQU+wvBYDf2FABk_-9-44dVSw@mail.gmail.com>
 <CAPJVwBk959=cLQrj9ZDMnRCfS-4JAVTpaNgHpvC4HQdt8fhoOw@mail.gmail.com>
Message-ID: <CAPJVwBkdesXHYCOTjpGJEojkjFL-P3j8xoY8uA6HaG-ShxNpyw@mail.gmail.com>

On Apr 3, 2017 8:59 AM, "Pierre Haessig" <pierre.haessig at crans.org> wrote:

Le 03/04/2017 ? 15:44, Jaime Fern?ndez del R?o a ?crit :

This
<https://github.com/JuliaLang/julia/blob/7fb758a275a0b4cf0e3f4cbf482c065cb32f0011/doc/src/stdlib/numbers.md#L116>
says
that Julia uses this library
<http://www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/SFMT/#dSFMT>, which is
different from the home brewed version of the Mersenne twister in NumPy.
The second link I posted claims their speed comes from generating double
precision numbers directly, rather than generating random bytes that have
to be converted to doubles, as is the case of NumPy through this magical
incantation
<https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L514>.
They also throw the SIMD acronym around, which likely means their random
number generation is parallelized.

My guess is that most of the speed-up comes from the SIMD parallelization:
the Mersenne algorithm does a lot of work
<https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L221>
to
produce 32 random bits, so that likely dominates over a couple of
arithmetic operations, even if divisions are involved.

Thanks for the feedback.

I'm not good in enough in reading Julia to be 100% sure, but I feel like
that the random.jl (https://github.com/JuliaLang/
julia/blob/master/base/random.jl) contains a Julia implementation of
Mersenne Twister... but I have no idea whether it is the "fancy" SIMD
version or the "old" 32bits version.


That code contains many references to "dSFMT", which is the name of the
"fancy" algorithm. IIUC dSFMT is related to the mersenne twister but is
actually a different generator altogether -- advertising that Julia uses
the mersenne twister is somewhat misleading IMHO. Of course this is really
the fault of the algorithm's designers for creating multiple algorithms
that have "mersenne twister" as part of their names...

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170403/60d8142e/attachment.html>

From pierre.haessig at crans.org  Mon Apr  3 12:33:16 2017
From: pierre.haessig at crans.org (Pierre Haessig)
Date: Mon, 3 Apr 2017 18:33:16 +0200
Subject: [Numpy-discussion] speed of random number generator compared to
 Julia
In-Reply-To: <CAG3t+pF0eWJOtM-GK7Kw5cG3ej3akykAVmZBXvJOvw5vLEPcWQ@mail.gmail.com>
References: <0db9ba2d-cb97-821e-62e5-b1b922c785a8@crans.org>
 <eb4189c9-5c07-c22e-7e31-05df35756008@crans.org>
 <CAPOWHWkbovJhZxbm3Sp0cFXTpxtYTFfM=hLnx9y9+R1RcVwYyw@mail.gmail.com>
 <CAG3t+pGC-uR4POwRtMGn9x2PmjO6=zA11Wr-_Y3cLABY6r3dbQ@mail.gmail.com>
 <a6475594-39b1-070e-ff60-30c079dd787b@crans.org>
 <CAG3t+pF0eWJOtM-GK7Kw5cG3ej3akykAVmZBXvJOvw5vLEPcWQ@mail.gmail.com>
Message-ID: <a06e5744-df6f-ebf6-f9c4-6992239d12af@crans.org>

Le 03/04/2017 ? 17:49, Neal Becker a ?crit :
> I think the intention is that this is the next gen of numpy
> randomstate, and will eventually be merged in.
Ah yes, I found the related issue in the meantime:
https://github.com/numpy/numpy/issues/6967

Thanks again for the pointers.

Pierre


From ralf.gommers at gmail.com  Mon Apr  3 16:22:29 2017
From: ralf.gommers at gmail.com (Ralf Gommers)
Date: Tue, 4 Apr 2017 08:22:29 +1200
Subject: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available
 to NumFOCUS projects (sponsored & affiliated)
In-Reply-To: <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com>
References: <CAFhTXRPCDkMxa450j=6Z8vhzsiOgZjV9r-M+qF+FoF861fxyiw@mail.gmail.com>
 <1489688042-5554705.54580375.fv2GIDbTc031721@rs159.luxsci.com>
 <CAFhTXRPZbViptfLkjEMxNstCPNt+24JX9LEnzkW3daczUbU3uA@mail.gmail.com>
 <CABL7CQiE-sCsU16d=d6xDOjuft8Z25h1TrYo49fELXBo9KLAmA@mail.gmail.com>
 <78cad834-ff24-3a21-ed14-912309d8089d@googlemail.com>
 <CABL7CQiTM8_wfXqq8wHyh4DsVY1nVM_G35g8DvjaUp0e9g9PHQ@mail.gmail.com>
 <CABL7CQhDiL1H8nMqSf7wt-zpXnHS-bN_Yv7Zw1fK45cgw-UFBg@mail.gmail.com>
 <CAPJVwBn409w7UNLU_ALL+cFtE=1S_gACnFQx+mrAsQ1AQ5osEQ@mail.gmail.com>
 <9079116f-b13c-a695-e1b8-e9777467c1d9@googlemail.com>
 <65432297-9ead-15b8-26bc-3424fd30e96b@googlemail.com>
Message-ID: <CABL7CQiVfQ7=4SJJ+OG8RzD5JJDu5pnCrksG8fVgvs3jsPMjhw@mail.gmail.com>

On Mon, Apr 3, 2017 at 11:28 PM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> On 31.03.2017 16:07, Julian Taylor wrote:
> > On 31.03.2017 15:51, Nathaniel Smith wrote:
> >> On Mar 31, 2017 1:15 AM, "Ralf Gommers" <ralf.gommers at gmail.com
> >> <mailto:ralf.gommers at gmail.com>> wrote:
> >>
> >>
> >>
> >>     On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers
> >>     <ralf.gommers at gmail.com <mailto:ralf.gommers at gmail.com>> wrote:
> >>
> >>
> >>
> >>         On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor
> >>         <jtaylor.debian at googlemail.com
> >>         <mailto:jtaylor.debian at googlemail.com>> wrote:
> >>
> >>             I have two ideas under one big important topic: make numpy
> >>             python3
> >>             compatible.
> >>
> >>             The first fits pretty well with the grant size and nobody
> >>             wants to do it
> >>             for free:
> >>             - fix our text IO functions under python3 and support
> multiple
> >>             encodings, not only latin1.
> >>             Reasonably simple to do, slap encoding arguments on the
> >>             functions,
> >>             generate test cases and somehow keep backward compatibility.
> >>             Some
> >>             prelimary unfinished work is in
> >>             https://github.com/numpy/numpy/pull/4208
> >>             <https://github.com/numpy/numpy/pull/4208>
> >>
> >>
> >>         I like that idea, it's a recurring pain point. Are you
> >>         interested to work on it, or are you thinking to advertise the
> >>         idea here to see if anyone steps up?
> >>
> >>
> >>     More thoughts on this anyone? Or preferences for this idea or the
> >>     numpy.org <http://numpy.org> one? Submission deadline is April 3rd
> >>     and we can only put in one proposal this time, so we need to (a)
> >>     make a choice between these ideas, and (b) write up a proposal.
> >>
> >>     If there's not enough replies to this so the choice is clear cut, I
> >>     will send out a poll to the core devs.
> >>
> >>
> >> Do we have anyone interested in doing the work in either case? That
> >> seems like the most important consideration to me...
>

Fair enough. Had a plan, but my weekend went a bit different than planned
so couldn't follow up on it.


> >>
> >> -n
> >>
> >
> > I could do the textio thing if no one shows up for numpy.org. I can
> > probably check again what is required in the next few days and write a
> > proposal.
> > The change will need reviewing in the end too, should that be
> > compensated too? It feels weird if not.
> >
>
> I have decided to not do it, as it is more or less just a bugfix and I
> currently do not feel capable of doing with added completion pressure.
>

Good call Julian. I struggled with the same thing - had a designer to do
the numpy.org work, but that still needed someone to do the content,
review, etc. Decided not to try to take that on, because I'm already
struggling to keep up.


> But I have collected some of related issues and discussions:
>

Thanks, I'm sure that'll be of use at some point.

Ralf


>
> https://github.com/numpy/numpy/issues/4600
> https://github.com/numpy/numpy/issues/3184
> http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-
> to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003
> # loadtxt
> https://github.com/numpy/numpy/pull/4208
> # genfromtxt
> http://numpy-discussion.10968.n7.nabble.com/genfromtxt-
> universal-newline-support-td37816.html
> https://github.com/dhomeier/numpy/commit/995ec93
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170404/fe07cc1a/attachment-0001.html>

From mads.ipsen at gmail.com  Tue Apr  4 03:14:07 2017
From: mads.ipsen at gmail.com (Mads Ipsen)
Date: Tue, 4 Apr 2017 09:14:07 +0200
Subject: [Numpy-discussion] bitwise or'ing rows
Message-ID: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com>

Hi

If I have an n x m array of bools, is there a handy way for me to 
perform a 'bitwise_and' or 'bitwise_or' along an axis, for example all 
the rows or all the columns? For example

a =
[[1,0,0,0],
  [0,0,1,0],
  [0,0,0,0]] (0 and 1 meaning True and False)

a.bitwise_or(axis=0)

giving

[1,0,1,0]

Best regards,

Mads


-- 
+---------------------------------------------------------------------+
| Mads Ipsen                                                          |
+----------------------------------+----------------------------------+
| Overgaden Oven Vandet 106, 4.tv  | phone:              +45-29716388 |
| DK-1415 K?benhavn K              | email:      mads.ipsen at gmail.com |
| Denmark                          | map  : https://goo.gl/maps/oQ6y6 |
+----------------------------------+----------------------------------+

From jaime.frio at gmail.com  Tue Apr  4 03:49:37 2017
From: jaime.frio at gmail.com (=?UTF-8?Q?Jaime_Fern=C3=A1ndez_del_R=C3=ADo?=)
Date: Tue, 4 Apr 2017 09:49:37 +0200
Subject: [Numpy-discussion] bitwise or'ing rows
In-Reply-To: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com>
References: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com>
Message-ID: <CAPOWHWkAOUAo75Dtv5iZOoyOKVaSu20X7y6VVu=SLnQ-Dx-z+Q@mail.gmail.com>

On Tue, Apr 4, 2017 at 9:14 AM, Mads Ipsen <mads.ipsen at gmail.com> wrote:

> Hi
>
> If I have an n x m array of bools, is there a handy way for me to perform
> a 'bitwise_and' or 'bitwise_or' along an axis, for example all the rows or
> all the columns? For example
>
> a =
> [[1,0,0,0],
>  [0,0,1,0],
>  [0,0,0,0]] (0 and 1 meaning True and False)
>
> a.bitwise_or(axis=0)
>
> giving
>
> [1,0,1,0]
>

I think what you want is equivalent to np.all(a, axis=0) for bitwise_and
and np.any(a, axis=0) for bitwise_or.

You can also use the more verbose np.bitwise_and.reduce(a, axis=0) and
np.bitwise_or.reduce(a, axis=0).

Jaime


>
> Best regards,
>
> Mads
>
>
>
>
> --
> +---------------------------------------------------------------------+
> | Mads Ipsen                                                          |
> +----------------------------------+----------------------------------+
> | Overgaden Oven Vandet 106, 4.tv  | phone:              +45-29716388 |
> | DK-1415 K?benhavn K              | email:      mads.ipsen at gmail.com |
> | Denmark                          | map  : https://goo.gl/maps/oQ6y6 |
> +----------------------------------+----------------------------------+
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus planes
de dominaci?n mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170404/e42cb7fc/attachment.html>

From mads.ipsen at gmail.com  Tue Apr  4 03:52:13 2017
From: mads.ipsen at gmail.com (Mads Ipsen)
Date: Tue, 4 Apr 2017 09:52:13 +0200
Subject: [Numpy-discussion] bitwise or'ing rows
In-Reply-To: <CAPOWHWkAOUAo75Dtv5iZOoyOKVaSu20X7y6VVu=SLnQ-Dx-z+Q@mail.gmail.com>
References: <46b4c4ab-b510-8d2f-b8bb-4b8c3925dab4@gmail.com>
 <CAPOWHWkAOUAo75Dtv5iZOoyOKVaSu20X7y6VVu=SLnQ-Dx-z+Q@mail.gmail.com>
Message-ID: <f394443f-b185-fb6b-c099-f9acabade4c1@gmail.com>

Thanks!

On 04/04/2017 09:49 AM, Jaime Fern?ndez del R?o wrote:
> On Tue, Apr 4, 2017 at 9:14 AM, Mads Ipsen <mads.ipsen at gmail.com
> <mailto:mads.ipsen at gmail.com>> wrote:
>
>     Hi
>
>     If I have an n x m array of bools, is there a handy way for me to
>     perform a 'bitwise_and' or 'bitwise_or' along an axis, for example
>     all the rows or all the columns? For example
>
>     a =
>     [[1,0,0,0],
>      [0,0,1,0],
>      [0,0,0,0]] (0 and 1 meaning True and False)
>
>     a.bitwise_or(axis=0)
>
>     giving
>
>     [1,0,1,0]
>
>
> I think what you want is equivalent to np.all(a, axis=0) for bitwise_and
> and np.any(a, axis=0) for bitwise_or.
>
> You can also use the more verbose np.bitwise_and.reduce(a, axis=0) and
> np.bitwise_or.reduce(a, axis=0).
>
> Jaime
>
>
>
>     Best regards,
>
>     Mads
>
>
>
>
>     --
>     +---------------------------------------------------------------------+
>     | Mads Ipsen                                                          |
>     +----------------------------------+----------------------------------+
>     | Overgaden Oven Vandet 106, 4.tv <http://4.tv>  | phone:
>       +45-29716388 <tel:%2B45-29716388> |
>     | DK-1415 K?benhavn K              | email:
>     mads.ipsen at gmail.com <mailto:mads.ipsen at gmail.com> |
>     | Denmark                          | map  : https://goo.gl/maps/oQ6y6 |
>     +----------------------------------+----------------------------------+
>     _______________________________________________
>     NumPy-Discussion mailing list
>     NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
>     <https://mail.python.org/mailman/listinfo/numpy-discussion>
>
>
>
>
> --
> (\__/)
> ( O.o)
> ( > <) Este es Conejo. Copia a Conejo en tu firma y ay?dale en sus
> planes de dominaci?n mundial.
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

-- 
+---------------------------------------------------------------------+
| Mads Ipsen                                                          |
+----------------------------------+----------------------------------+
| Overgaden Oven Vandet 106, 4.tv  | phone:              +45-29716388 |
| DK-1415 K?benhavn K              | email:      mads.ipsen at gmail.com |
| Denmark                          | map  : https://goo.gl/maps/oQ6y6 |
+----------------------------------+----------------------------------+

From charlesr.harris at gmail.com  Sat Apr  8 15:45:30 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Sat, 8 Apr 2017 13:45:30 -0600
Subject: [Numpy-discussion] __array_ufunc__
Message-ID: <CAB6mnxKYgweJ2muqwogj338yRx5ZoFyHL-j+n-a9X=xbq+RDsQ@mail.gmail.com>

Hi All,

After a week of review and rework, the new and improved __array_ufunc__ has
turned the corner and is headed down the homestretch. Now is the time for
interested parties to give it a final lookover at
https://github.com/numpy/numpy/pull/8247.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170408/a3b66d74/attachment.html>

From jtaylor.debian at googlemail.com  Sun Apr  9 07:27:17 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Sun, 9 Apr 2017 13:27:17 +0200
Subject: [Numpy-discussion] call for testing: unicode loadtxt/genfromtxt
Message-ID: <027728fd-3619-6e67-da0a-3e591020dcd8@googlemail.com>

hi,
It has been very very long overdue but we finally have an attempt of
making our text io functions actually use text IO instead of bytes IO.
This means genfromtxt, loadtxt, fromregex and savetxt should support
unicode input files of any python supported encoding and universal newlines.
This is the first stepping stone to finally making numpy python3 compatible.

The code is available in:
https://github.com/numpy/numpy/pull/4208

Great effort has been spent to keep it backward compatible but we only
have our testsuite as a reference which for sure does not cover all of
the workarounds employed for this issue in the last 8 years.
So we need people to dig out their ugliest hacks and test if they still
work with this changeset.
Functions that need testing are:
loadtxt
genfromtxt
fromregex
savetxt

Test on any input that worked in older versions of numpy (including gzip
compressed) and inputs that did not work because they where encoded in
something other than latin1 or had issues with linebreaks.

The PR adds an encoding keyword argument to all functions dealing with
text input and output. All streams opened by the function have been
changed from byte streams to text streams.
As previously only latin1 encoded byte streams were supported, all input
bytestreams are still decoded as such.

Converters added by the user may have been relying on the input to them
being bytes. To deal with that the default encoding argument is 'bytes'
which corresponds to the default encoding (None) and enables conversion
to latin1 encoded bytes before passing to user converters.
If you want to use converters based on strings now you have to
explicitly set encoding to something else (e.g. None).

Currently the functions do not support the newlines keyword argument the
python IO strings support. This probably will still get added.

Related issues and discussions:

https://github.com/numpy/numpy/issues/4600
https://github.com/numpy/numpy/issues/3184
https://github.com/numpy/numpy/issues/4939
https://github.com/numpy/numpy/issues/4543
http://numpy-discussion.10968.n7.nabble.com/using-loadtxt-to-load-a-text-file-in-to-a-numpy-array-tt35992.html#a36003
http://numpy-discussion.10968.n7.nabble.com/genfromtxt-universal-newline-support-td37816.html
https://github.com/dhomeier/numpy/commit/995ec93

cheers,
Julian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170409/9c0878b0/attachment.sig>

From divenex at gmail.com  Thu Apr 13 07:02:33 2017
From: divenex at gmail.com (Dive Nex)
Date: Thu, 13 Apr 2017 12:02:33 +0100
Subject: [Numpy-discussion] Fixing inconsistent behaviour of reduceat()
Message-ID: <CAJegyA2HP3LkagkqYUQ=Hyeh=zDCK+KV3YwzdX-2V-v1KGdQFQ@mail.gmail.com>

Hi all,


I would like to try to reach a consensus about a long standing inconsistent
behavior of reduceat() reported and discussed here


  https://github.com/numpy/numpy/issues/834


In summary, it seems an elegant and logical design choice, that all users
will expect, for


    out = ufunc.reduceat(a, indices)


to produce, for all indices j (except for the last one) the following


    out[j] = ufunc.reduce(a[indices[j]:indices[j+1]])


However the current documented and actual behavior is for the case


    indices[i] >= indices[i+1]

to return simply


    out[j] = a[indices[i]]


I cannot see any application where this behavior is useful or where this
choice makes sense. This seems just a bug that should be fixed.


What do people think?


PS: A quick fix for the current implementation is


    out = ufunc.reduceat(a, indices)
    out *= np.diff(indices) > 0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170413/8c8d4a7a/attachment.html>

From m.h.vankerkwijk at gmail.com  Thu Apr 13 11:03:56 2017
From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk)
Date: Thu, 13 Apr 2017 11:03:56 -0400
Subject: [Numpy-discussion] Fixing inconsistent behaviour of reduceat()
In-Reply-To: <CAJegyA2HP3LkagkqYUQ=Hyeh=zDCK+KV3YwzdX-2V-v1KGdQFQ@mail.gmail.com>
References: <CAJegyA2HP3LkagkqYUQ=Hyeh=zDCK+KV3YwzdX-2V-v1KGdQFQ@mail.gmail.com>
Message-ID: <CAJNV+9sVYs=nm6GxzQmoPV99VFhC9c9138adzVBEo+Txnbj_Fw@mail.gmail.com>

Discussion on-going at the above issue, but perhaps worth mentioning
more broadly the alternative of adding a slice argument (or start,
stop, step arguments) to ufunc.reduce, which would mean we can just
deprecate reduceat altogether, as most use of it would just be

add.reduce(array, slice=slice(indices[:-1], indices[1:])

(where now we are free to make the behaviour match what is expected
for an empty slice)

Here, one would broadcast the slice if it were 0-d, and could pass in
tuples of slices if a tuple of axes was used.

-- Marten

From charlesr.harris at gmail.com  Fri Apr 14 20:19:56 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Fri, 14 Apr 2017 18:19:56 -0600
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
Message-ID: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>

Hi All,

It may be early to discuss dropping support for Python 2.7, but there is a
disturbance in the force that suggests that it might be worth looking
forward to the year 2020 when Python itself will drop support for 2.7.
There is also a website, http://www.python3statement.org, where several
projects in the scientific python stack have pledged to be Python 2.7 free
by that date.  Given that, a preliminary discussion of the subject might be
interesting, if only to gather information of where the community currently
stands.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170414/dcd6b844/attachment.html>

From njs at pobox.com  Sat Apr 15 01:19:42 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Fri, 14 Apr 2017 22:19:42 -0700
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
Message-ID: <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>

On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
> Hi All,
>
> It may be early to discuss dropping support for Python 2.7, but there is a
> disturbance in the force that suggests that it might be worth looking
> forward to the year 2020 when Python itself will drop support for 2.7. There
> is also a website, http://www.python3statement.org, where several projects
> in the scientific python stack have pledged to be Python 2.7 free by that
> date.  Given that, a preliminary discussion of the subject might be
> interesting, if only to gather information of where the community currently
> stands.

One reasonable position would that numpy releases that happen while
2.7 is supported upstream will also support 2.7, and releases after
that won't.

>From numpy's perspective, I feel like the most important reason to
continue supporting 2.7 is our ability to convince people to keep
upgrading. (Not the only reason, but the most important.) What I mean
is: if we dropped 2.7 support tomorrow then it wouldn't actually make
numpy unavailable on python 2.7; it would just mean that lots of users
stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
end of the world ? numpy is mature software and 1.12 works pretty
well. The big problem IMO would be if this then meant that lots of
downstream projects felt that they had to continue supporting 1.12
going forward, which makes it very difficult for us to effectively
ship new features or even bug fixes ? I mean, we can ship them, but
no-one will use them. And if a downstream project finds a bug in numpy
and can't upgrade numpy, then the tendency is to work around it
instead of reporting it upstream. I think this is the main thing we
want to avoid.

This kind of means that we're at the mercy of downstream projects,
though ? if scipy/pandas/etc. decide they want to support 2.7 until
2022, it might be in our best interest to do the same. But there's a
collective action problem here: we want to keep supporting 2.7 so long
as they do, but at the same time they may feel they need to keep
supporting 2.7 as long as we do. And all of us would prefer to drop
2.7 support sooner rather than later, but we might all get stuck
because we're waiting for someone else to move first.

So my suggestion would be that numpy make some official announcement
that our plan is to drop support for python 2 immediately after
cpython upstream does. If worst comes to worst we can always decide to
extend it at the time... but if we make the announcement now, then
it's less likely that we'll need to :-).

Another interesting project to look at here is django, since they
occupy a similar place in the ecosystem (e.g. last I checked numpy and
django are the two most-imported python packages on github):
https://www.djangoproject.com/weblog/2015/jun/25/roadmap/
Their approach isn't directly applicable, because unlike us they have
a strict time-based release schedule, defined support period for each
release, and a distinction between regular and long-term support
releases, where regular releases act sort of like
pre-releases-on-steroids for the next LTS release. But basically what
they settled on is philosophically similar to what I'm suggesting:
they don't want an LTS to be supporting 2.7 beyond when cpython is
supporting it. Then on top of that they don't want to support 2.7 in
the regular releases leading up to that LTS either, so the net effect
is that their last release with 2.7 support came out last week, and it
will be supported until 2020 :-). And another useful precedent I think
is that they announced this two years ago, back in 2015; if we make an
announcement now, we'll be be giving a similar amount of warning.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org

From ralf.gommers at gmail.com  Sat Apr 15 01:47:31 2017
From: ralf.gommers at gmail.com (Ralf Gommers)
Date: Sat, 15 Apr 2017 17:47:31 +1200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
Message-ID: <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>

On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> > Hi All,
> >
> > It may be early to discuss dropping support for Python 2.7, but there is
> a
> > disturbance in the force that suggests that it might be worth looking
> > forward to the year 2020 when Python itself will drop support for 2.7.
> There
> > is also a website, http://www.python3statement.org, where several
> projects
> > in the scientific python stack have pledged to be Python 2.7 free by that
> > date.  Given that, a preliminary discussion of the subject might be
> > interesting, if only to gather information of where the community
> currently
> > stands.
>
> One reasonable position would that numpy releases that happen while
> 2.7 is supported upstream will also support 2.7, and releases after
> that won't.
>
> From numpy's perspective, I feel like the most important reason to
> continue supporting 2.7 is our ability to convince people to keep
> upgrading. (Not the only reason, but the most important.) What I mean
> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
> numpy unavailable on python 2.7; it would just mean that lots of users
> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
> end of the world ? numpy is mature software and 1.12 works pretty
> well. The big problem IMO would be if this then meant that lots of
> downstream projects felt that they had to continue supporting 1.12
> going forward, which makes it very difficult for us to effectively
> ship new features or even bug fixes ? I mean, we can ship them, but
> no-one will use them. And if a downstream project finds a bug in numpy
> and can't upgrade numpy, then the tendency is to work around it
> instead of reporting it upstream. I think this is the main thing we
> want to avoid.
>

+1


>
> This kind of means that we're at the mercy of downstream projects,
> though ? if scipy/pandas/etc. decide they want to support 2.7 until
> 2022, it might be in our best interest to do the same. But there's a
> collective action problem here: we want to keep supporting 2.7 so long
> as they do, but at the same time they may feel they need to keep
> supporting 2.7 as long as we do. And all of us would prefer to drop
> 2.7 support sooner rather than later, but we might all get stuck
>
because we're waiting for someone else to move first.
>

I don't quite agree about being stuck. These kind of upgrades should and
usually do go top of stack to bottom. Something like Jupyter which is
mostly an end user tool goes first (they announced 2020 quite a while ago),
domain specific packages go at a similar time, then scipy & co, and only
after that numpy. Cython will be even later I'm sure - it still supports
Python 2.6.


>
> So my suggestion would be that numpy make some official announcement
> that our plan is to drop support for python 2 immediately after
> cpython upstream does.


Not quite sure CPython schedule is relevant - important bug fixes haven't
been making it into 2.7 for a very long time now, so the only change is the
rare security patch.


> If worst comes to worst we can always decide to
> extend it at the time... but if we make the announcement now, then
> it's less likely that we'll need to :-).
>

I'd be in favor of putting out a schedule in coordination with
scipy/pandas/etc, but it probably should look more like
- 2020: what's on http://www.python3statement.org/ now
- 2021: scipy / pandas / scikit-learn / etc.
- 2022: numpy

Ralf


> Another interesting project to look at here is django, since they
> occupy a similar place in the ecosystem (e.g. last I checked numpy and
> django are the two most-imported python packages on github):
> https://www.djangoproject.com/weblog/2015/jun/25/roadmap/
> Their approach isn't directly applicable, because unlike us they have
> a strict time-based release schedule, defined support period for each
> release, and a distinction between regular and long-term support
> releases, where regular releases act sort of like
> pre-releases-on-steroids for the next LTS release. But basically what
> they settled on is philosophically similar to what I'm suggesting:
> they don't want an LTS to be supporting 2.7 beyond when cpython is
> supporting it. Then on top of that they don't want to support 2.7 in
> the regular releases leading up to that LTS either, so the net effect
> is that their last release with 2.7 support came out last week, and it
> will be supported until 2020 :-). And another useful precedent I think
> is that they announced this two years ago, back in 2015; if we make an
> announcement now, we'll be be giving a similar amount of warning.
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170415/6432682f/attachment-0001.html>

From njs at pobox.com  Sat Apr 15 03:02:42 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Sat, 15 Apr 2017 00:02:42 -0700
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
 <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>
Message-ID: <CAPJVwB=PVLu6mwFX7u4mtib8a1XuuWkgxfs-ne0KANcvqsLB4g@mail.gmail.com>

On Fri, Apr 14, 2017 at 10:47 PM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>
> On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith <njs at pobox.com> wrote:
[...]
>> From numpy's perspective, I feel like the most important reason to
>> continue supporting 2.7 is our ability to convince people to keep
>> upgrading. (Not the only reason, but the most important.) What I mean
>> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
>> numpy unavailable on python 2.7; it would just mean that lots of users
>> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
>> end of the world ? numpy is mature software and 1.12 works pretty
>> well. The big problem IMO would be if this then meant that lots of
>> downstream projects felt that they had to continue supporting 1.12
>> going forward, which makes it very difficult for us to effectively
>> ship new features or even bug fixes ? I mean, we can ship them, but
>> no-one will use them. And if a downstream project finds a bug in numpy
>> and can't upgrade numpy, then the tendency is to work around it
>> instead of reporting it upstream. I think this is the main thing we
>> want to avoid.
>
>
> +1
>
>>
>>
>> This kind of means that we're at the mercy of downstream projects,
>> though ? if scipy/pandas/etc. decide they want to support 2.7 until
>> 2022, it might be in our best interest to do the same. But there's a
>> collective action problem here: we want to keep supporting 2.7 so long
>> as they do, but at the same time they may feel they need to keep
>> supporting 2.7 as long as we do. And all of us would prefer to drop
>> 2.7 support sooner rather than later, but we might all get stuck
>>
>> because we're waiting for someone else to move first.
>
>
> I don't quite agree about being stuck. These kind of upgrades should and
> usually do go top of stack to bottom. Something like Jupyter which is mostly
> an end user tool goes first (they announced 2020 quite a while ago), domain
> specific packages go at a similar time, then scipy & co, and only after that
> numpy. Cython will be even later I'm sure - it still supports Python 2.6.

To make sure we're on the same page about what "2020" means here: the
latest release of IPython is 5.0, which came out in July last year.
This is the last release that supports py2; they dropped support for
py2 in master months ago, and 6.0 (whose schedule has been slipping,
but I think should be out Any Time Now?) won't support py2. Their plan
is to keep backporting bug fixes to 5.x until the end of 2017; after
that the core team won't support py2 at all. And they've also
announced that if volunteers want to step up to maintain 5.x after
that, then they're willing to keep accepting pull requests until July
2019.

Refs:
  https://blog.jupyter.org/2016/07/08/ipython-5-0-released/
  https://github.com/jupyter/roadmap/blob/master/accepted/migration-to-python-3-only.md

I suspect that in practice that "end of 2017" date will the
end-of-support date for most intents and purposes. And for numpy with
its vaguely defined support periods, I think it makes most sense to
talk in terms of release dates; so if we want to compare
apples-to-apples, my suggestion is that numpy drops py2 support in
2020 and in that sense ipython dropped py2 support in july last year.

>>
>> So my suggestion would be that numpy make some official announcement
>> that our plan is to drop support for python 2 immediately after
>> cpython upstream does.
>
>
> Not quite sure CPython schedule is relevant - important bug fixes haven't
> been making it into 2.7 for a very long time now, so the only change is the
> rare security patch.

Huh? 2.7 gets tons of changes: https://github.com/python/cpython/commits/2.7
Officially CPython has 2 modes for releases: "regular support" and
"security fixes only". 2.7 is special ? it get regular support, and
then on top of that it also has a special exception to allow certain
kinds of major changes, like the ssl module backports.

If you know of important bug fixes that they're missing then I think
they'd like to know :-).

Anyway, the reason the CPython schedule is relevant is that once they
drop support, it *will* stop getting security patches, so it will
become increasingly impossible to use safely.

>>
>> If worst comes to worst we can always decide to
>> extend it at the time... but if we make the announcement now, then
>> it's less likely that we'll need to :-).
>
>
> I'd be in favor of putting out a schedule in coordination with
> scipy/pandas/etc, but it probably should look more like
> - 2020: what's on http://www.python3statement.org/ now
> - 2021: scipy / pandas / scikit-learn / etc.

Um... pandas is already on python3statement.org right now :-)

> - 2022: numpy

Honestly I don't see why we should plan to support python 2 a day
longer than our major downstream dependencies. That was the point of
my first paragraph: for us the main benefit to supporting 2 is to
avoid forcing our downstream dependencies to pin an old numpy. What's
that extra year get us if they've already moved on?

The other odd thing about this schedule is that you're suggesting that
the organizing principle should be that the stack switches from
top-of-stack to bottom... but then you left out the bottom of the
stack! :-)

- 2020: python

-n

-- 
Nathaniel J. Smith -- https://vorpus.org

From jtaylor.debian at googlemail.com  Sat Apr 15 04:47:34 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Sat, 15 Apr 2017 10:47:34 +0200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
Message-ID: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>

On 15.04.2017 02:19, Charles R Harris wrote:
> Hi All,
> 
> It may be early to discuss dropping support for Python 2.7, but there is
> a disturbance in the force that suggests that it might be worth looking
> forward to the year 2020 when Python itself will drop support for 2.7.
> There is also a website, http://www.python3statement.org
> <http://www.python3statement.org/>, where several projects in the
> scientific python stack have pledged to be Python 2.7 free by that
> date.  Given that, a preliminary discussion of the subject might be
> interesting, if only to gather information of where the community
> currently stands.
> 
> Chuck
> 
> 

I am very against planning to drop it.
Numpy is the lowest part of the scipy stack so it is not our decision to
do so and we don't gain that much by doing so.
Lets discuss this in 3 years or when the distributions kick out
python2.7 (which won't happen before ~2022). There is no point doing so now.
Also PyPy does not plan on dropping 2.7 by that time.

Also before we even consider this we need to fix our python3 support.
This means getting the IO functions
(https://github.com/numpy/numpy/pull/4208) in order and adding a string
type that people are less reluctant to use than the 4 byte unicode we
currently offer.

From perimosocordiae at gmail.com  Sat Apr 15 08:49:18 2017
From: perimosocordiae at gmail.com (CJ Carey)
Date: Sat, 15 Apr 2017 08:49:18 -0400
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>
Message-ID: <CAEfGn+xP6G2uCNa1bcDZ0BkEj9hznmEiETQhVENfPiEJMgjH-w@mail.gmail.com>

What do we think about the trade-offs of having a shared 2.7/3.x codebase
going forward?

As Python3 adds more nontrivial features, keeping compatibility with 2.7
becomes more burdensome.

Will there be a separate py2-numpy branch/repo at some point before ending
support?


On Apr 15, 2017 4:48 AM, "Julian Taylor" <jtaylor.debian at googlemail.com>
wrote:

> On 15.04.2017 02:19, Charles R Harris wrote:
> > Hi All,
> >
> > It may be early to discuss dropping support for Python 2.7, but there is
> > a disturbance in the force that suggests that it might be worth looking
> > forward to the year 2020 when Python itself will drop support for 2.7.
> > There is also a website, http://www.python3statement.org
> > <http://www.python3statement.org/>, where several projects in the
> > scientific python stack have pledged to be Python 2.7 free by that
> > date.  Given that, a preliminary discussion of the subject might be
> > interesting, if only to gather information of where the community
> > currently stands.
> >
> > Chuck
> >
> >
>
> I am very against planning to drop it.
> Numpy is the lowest part of the scipy stack so it is not our decision to
> do so and we don't gain that much by doing so.
> Lets discuss this in 3 years or when the distributions kick out
> python2.7 (which won't happen before ~2022). There is no point doing so
> now.
> Also PyPy does not plan on dropping 2.7 by that time.
>
> Also before we even consider this we need to fix our python3 support.
> This means getting the IO functions
> (https://github.com/numpy/numpy/pull/4208) in order and adding a string
> type that people are less reluctant to use than the 4 byte unicode we
> currently offer.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170415/9830f372/attachment.html>

From m.h.vankerkwijk at gmail.com  Sat Apr 15 10:17:01 2017
From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk)
Date: Sat, 15 Apr 2017 10:17:01 -0400
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAEfGn+xP6G2uCNa1bcDZ0BkEj9hznmEiETQhVENfPiEJMgjH-w@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>
 <CAEfGn+xP6G2uCNa1bcDZ0BkEj9hznmEiETQhVENfPiEJMgjH-w@mail.gmail.com>
Message-ID: <CAJNV+9tDLgg2JsB+yeoc7nAsS2vZYhmy-PnaXAx1JXsdxpz_NQ@mail.gmail.com>

Hi All,

I think Nathaniel had a good summary. My own 2? are mostly about the
burden of supporting python2. I have only recently attempted to make
changes in the C codebase of numpy and one of the reasons I found this
more than a little daunting is the complex web of include files. In
this respect, the python3/2 split is certainly not the biggest
hindrance, but it was also not particularly helpful for understanding
to have "translations" of python2 macros to python3 equivalents in
npy_3kcompat.h: for newcomers, it would seem helpful if they could
read the Python3 C-API and be able to understand what is going on.

Of course, the above also proves Julian's point: for strings in
particular, numpy still has a bit to go to be fully python3-ized.

Finally, as for pypy: they just made a huge effort to become
compatible with python3; is their plan really to stick with python2
much beyond 2020?

All the best,

Marten

From jtaylor.debian at googlemail.com  Sat Apr 15 10:30:18 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Sat, 15 Apr 2017 16:30:18 +0200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAJNV+9tDLgg2JsB+yeoc7nAsS2vZYhmy-PnaXAx1JXsdxpz_NQ@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>
 <CAEfGn+xP6G2uCNa1bcDZ0BkEj9hznmEiETQhVENfPiEJMgjH-w@mail.gmail.com>
 <CAJNV+9tDLgg2JsB+yeoc7nAsS2vZYhmy-PnaXAx1JXsdxpz_NQ@mail.gmail.com>
Message-ID: <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com>

On 15.04.2017 16:17, Marten van Kerkwijk wrote:
> Hi All,
> 
> I think Nathaniel had a good summary. My own 2? are mostly about the
> burden of supporting python2. I have only recently attempted to make
> changes in the C codebase of numpy and one of the reasons I found this
> more than a little daunting is the complex web of include files. In
> this respect, the python3/2 split is certainly not the biggest
> hindrance, but it was also not particularly helpful for understanding
> to have "translations" of python2 macros to python3 equivalents in
> npy_3kcompat.h: for newcomers, it would seem helpful if they could
> read the Python3 C-API and be able to understand what is going on.
> 
> Of course, the above also proves Julian's point: for strings in
> particular, numpy still has a bit to go to be fully python3-ized.
> 
> Finally, as for pypy: they just made a huge effort to become
> compatible with python3; is their plan really to stick with python2
> much beyond 2020?
> 

http://doc.pypy.org/en/latest/faq.html#how-long-will-pypy-support-python2

According to that Python2 support will be available as long as PyPy
itself exists.

From jtaylor.debian at googlemail.com  Sat Apr 15 10:33:45 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Sat, 15 Apr 2017 16:33:45 +0200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <062c36f4-171b-3152-73c7-e96e405a753f@googlemail.com>
 <CAEfGn+xP6G2uCNa1bcDZ0BkEj9hznmEiETQhVENfPiEJMgjH-w@mail.gmail.com>
 <CAJNV+9tDLgg2JsB+yeoc7nAsS2vZYhmy-PnaXAx1JXsdxpz_NQ@mail.gmail.com>
 <6d1b2f13-5647-84b4-d7f3-99af2b786826@googlemail.com>
Message-ID: <0591969a-3507-efbd-8017-9361724bb22a@googlemail.com>

On 15.04.2017 16:30, Julian Taylor wrote:
> On 15.04.2017 16:17, Marten van Kerkwijk wrote:
>> Hi All,
>>
>> I think Nathaniel had a good summary. My own 2? are mostly about the
>> burden of supporting python2. I have only recently attempted to make
>> changes in the C codebase of numpy and one of the reasons I found this
>> more than a little daunting is the complex web of include files. In
>> this respect, the python3/2 split is certainly not the biggest
>> hindrance, but it was also not particularly helpful for understanding
>> to have "translations" of python2 macros to python3 equivalents in
>> npy_3kcompat.h: for newcomers, it would seem helpful if they could
>> read the Python3 C-API and be able to understand what is going on.
>>
>> Of course, the above also proves Julian's point: for strings in
>> particular, numpy still has a bit to go to be fully python3-ized.
>>
>> Finally, as for pypy: they just made a huge effort to become
>> compatible with python3; is their plan really to stick with python2
>> much beyond 2020?
>>
> 
> http://doc.pypy.org/en/latest/faq.html#how-long-will-pypy-support-python2
> 
> According to that Python2 support will be available as long as PyPy
> itself exists.
> 

Of course they don't support the stdlib itself, so this doesn't actually
mean much depending on how the much community will care about fixing
security issues in the python2 stdlib.
But at least there might be a place where patches can get accepted and
released.

From charlesr.harris at gmail.com  Sat Apr 15 11:44:43 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Sat, 15 Apr 2017 09:44:43 -0600
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
 <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>
Message-ID: <CAB6mnxJngZpV7C3nLOVCXO7-wF6Cj5xyuxfUN0uc5wLtT3zKWw@mail.gmail.com>

On Fri, Apr 14, 2017 at 11:47 PM, Ralf Gommers <ralf.gommers at gmail.com>
wrote:

>
>
> On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>> > Hi All,
>> >
>> > It may be early to discuss dropping support for Python 2.7, but there
>> is a
>> > disturbance in the force that suggests that it might be worth looking
>> > forward to the year 2020 when Python itself will drop support for 2.7.
>> There
>> > is also a website, http://www.python3statement.org, where several
>> projects
>> > in the scientific python stack have pledged to be Python 2.7 free by
>> that
>> > date.  Given that, a preliminary discussion of the subject might be
>> > interesting, if only to gather information of where the community
>> currently
>> > stands.
>>
>> One reasonable position would that numpy releases that happen while
>> 2.7 is supported upstream will also support 2.7, and releases after
>> that won't.
>>
>> From numpy's perspective, I feel like the most important reason to
>> continue supporting 2.7 is our ability to convince people to keep
>> upgrading. (Not the only reason, but the most important.) What I mean
>> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
>> numpy unavailable on python 2.7; it would just mean that lots of users
>> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
>> end of the world ? numpy is mature software and 1.12 works pretty
>> well. The big problem IMO would be if this then meant that lots of
>> downstream projects felt that they had to continue supporting 1.12
>> going forward, which makes it very difficult for us to effectively
>> ship new features or even bug fixes ? I mean, we can ship them, but
>> no-one will use them. And if a downstream project finds a bug in numpy
>> and can't upgrade numpy, then the tendency is to work around it
>> instead of reporting it upstream. I think this is the main thing we
>> want to avoid.
>>
>
> +1
>
>
>>
>> This kind of means that we're at the mercy of downstream projects,
>> though ? if scipy/pandas/etc. decide they want to support 2.7 until
>> 2022, it might be in our best interest to do the same. But there's a
>> collective action problem here: we want to keep supporting 2.7 so long
>> as they do, but at the same time they may feel they need to keep
>> supporting 2.7 as long as we do. And all of us would prefer to drop
>> 2.7 support sooner rather than later, but we might all get stuck
>>
> because we're waiting for someone else to move first.
>>
>
> I don't quite agree about being stuck. These kind of upgrades should and
> usually do go top of stack to bottom. Something like Jupyter which is
> mostly an end user tool goes first (they announced 2020 quite a while ago),
> domain specific packages go at a similar time, then scipy & co, and only
> after that numpy. Cython will be even later I'm sure - it still supports
> Python 2.6.
>
>
>>
>> So my suggestion would be that numpy make some official announcement
>> that our plan is to drop support for python 2 immediately after
>> cpython upstream does.
>
>
> Not quite sure CPython schedule is relevant - important bug fixes haven't
> been making it into 2.7 for a very long time now, so the only change is the
> rare security patch.
>
>
>> If worst comes to worst we can always decide to
>> extend it at the time... but if we make the announcement now, then
>> it's less likely that we'll need to :-).
>>
>
> I'd be in favor of putting out a schedule in coordination with
> scipy/pandas/etc, but it probably should look more like
> - 2020: what's on http://www.python3statement.org/ now
> - 2021: scipy / pandas / scikit-learn / etc.
> - 2022: numpy
>
>

I think things will move faster than one might think. In any case, we are
probably about 5 releases away from 2020. As Nathaniel points out, numpy is
mature and 1.12 is pretty good already, so hopefully 1.17 would be even
better. I think dropping Python 2.7 support at that point would not cause
much in the way of problems as 1.17 should be good for a number of years
after that and would be easily installed from PyPI.

A bigger driver long term might be uptake by distros, although the impact
of that might be harder to estimate. I suspect it will affect developers
more than end users, who will more likely be using Anaconda, Canopy, or
similar to manage their development environment.

Another thing to consider is that future developers will likely have less
and less experience with Python 2.7 as teaching and classroom use moves to
3.

Whatever we decide, I think Nathaniel's point about making an early
announcement is a good one, as is Julian's comment about bringing Numpy
into full support of Python 3. We need to put together a plan with at least
a tentative schedule that will help get downstream projects thinking about
their own plans and engender more feedback.

It might be useful to have a BOF(s) at SciPy 2017 where the issue can be
discussed with a broader range of people.

<snip>

Chuck


>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170415/d72e4345/attachment-0001.html>

From jtaylor.debian at googlemail.com  Sat Apr 15 12:32:44 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Sat, 15 Apr 2017 18:32:44 +0200
Subject: [Numpy-discussion] testing needed for f2py with char/string arrays
Message-ID: <7fc9930b-2711-8503-78b2-5e90ad773017@googlemail.com>

hi,
we need to deprecate the NPY_CHAR typenumber [0] in order to enable us
to add new core dtypes without adding ugly hacks to our ABI.
Technically the typenumber was deprecated way back in 1.6 when it
accidentally broke our ABI. But due to lack of time f2py never got
updated to actually follow through.
In order to unblock our dtype development cleanly we want to finally do
the deprecation properly.
As nobody really knows how f2py works and there are no existing unit
tests covering the char dtype the change is very likely to break something.
The change is available here:
https://github.com/numpy/numpy/pull/8948

It attempts to map the NPY_CHAR dtype to the equivalent NPY_STRING with
itemsize 1. I have only been able to come up with a test that covers one
of the changed places.
So if you have a f2py usecase that in some way involves passing arrays
of strings back and forth between python and fortran, please test that
branch or post a reproducable example here.

Thanks,
Julian

[0]
https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/ndarraytypes.h#L74

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170415/b8d513f6/attachment.sig>

From ralf.gommers at gmail.com  Sat Apr 15 17:20:29 2017
From: ralf.gommers at gmail.com (Ralf Gommers)
Date: Sun, 16 Apr 2017 09:20:29 +1200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAPJVwB=PVLu6mwFX7u4mtib8a1XuuWkgxfs-ne0KANcvqsLB4g@mail.gmail.com>
References: <CAB6mnxKU0jSP1+0HPfqK4Wr_9SaHUfjA=zLeGuuuN=W7x_OQ1g@mail.gmail.com>
 <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
 <CABL7CQhb5GpT8ERWnR9QR9aEbkHE6OKZW_66SWVvYnJgQEUTEw@mail.gmail.com>
 <CAPJVwB=PVLu6mwFX7u4mtib8a1XuuWkgxfs-ne0KANcvqsLB4g@mail.gmail.com>
Message-ID: <CABL7CQhFKVx+FnWRuJe8JXVM-QCuYPQ__UdRqziNHkC2U2-GgQ@mail.gmail.com>

On Sat, Apr 15, 2017 at 7:02 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Fri, Apr 14, 2017 at 10:47 PM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> >
> > On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith <njs at pobox.com> wrote:
> [...]
> >> From numpy's perspective, I feel like the most important reason to
> >> continue supporting 2.7 is our ability to convince people to keep
> >> upgrading. (Not the only reason, but the most important.) What I mean
> >> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
> >> numpy unavailable on python 2.7; it would just mean that lots of users
> >> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
> >> end of the world ? numpy is mature software and 1.12 works pretty
> >> well. The big problem IMO would be if this then meant that lots of
> >> downstream projects felt that they had to continue supporting 1.12
> >> going forward, which makes it very difficult for us to effectively
> >> ship new features or even bug fixes ? I mean, we can ship them, but
> >> no-one will use them. And if a downstream project finds a bug in numpy
> >> and can't upgrade numpy, then the tendency is to work around it
> >> instead of reporting it upstream. I think this is the main thing we
> >> want to avoid.
> >
> >
> > +1
> >
> >>
> >>
> >> This kind of means that we're at the mercy of downstream projects,
> >> though ? if scipy/pandas/etc. decide they want to support 2.7 until
> >> 2022, it might be in our best interest to do the same. But there's a
> >> collective action problem here: we want to keep supporting 2.7 so long
> >> as they do, but at the same time they may feel they need to keep
> >> supporting 2.7 as long as we do. And all of us would prefer to drop
> >> 2.7 support sooner rather than later, but we might all get stuck
> >>
> >> because we're waiting for someone else to move first.
> >
> >
> > I don't quite agree about being stuck. These kind of upgrades should and
> > usually do go top of stack to bottom. Something like Jupyter which is
> mostly
> > an end user tool goes first (they announced 2020 quite a while ago),
> domain
> > specific packages go at a similar time, then scipy & co, and only after
> that
> > numpy. Cython will be even later I'm sure - it still supports Python 2.6.
>
> To make sure we're on the same page about what "2020" means here: the
> latest release of IPython is 5.0, which came out in July last year.
> This is the last release that supports py2; they dropped support for
> py2 in master months ago, and 6.0 (whose schedule has been slipping,
> but I think should be out Any Time Now?) won't support py2. Their plan
> is to keep backporting bug fixes to 5.x until the end of 2017; after
> that the core team won't support py2 at all. And they've also
> announced that if volunteers want to step up to maintain 5.x after
> that, then they're willing to keep accepting pull requests until July
> 2019.
>
> Refs:
>   https://blog.jupyter.org/2016/07/08/ipython-5-0-released/
>   https://github.com/jupyter/roadmap/blob/master/accepted/migr
> ation-to-python-3-only.md
>
> I suspect that in practice that "end of 2017" date will the
> end-of-support date for most intents and purposes. And for numpy with
> its vaguely defined support periods, I think it makes most sense to
> talk in terms of release dates;


agreed, release dates makes sense. we don't want to be doing some kind of
LTS scheme.


> so if we want to compare
> apples-to-apples, my suggestion is that numpy drops py2 support in
> 2020 and in that sense ipython dropped py2 support in july last year.
>
> >>
> >> So my suggestion would be that numpy make some official announcement
> >> that our plan is to drop support for python 2 immediately after
> >> cpython upstream does.
> >
> >
> > Not quite sure CPython schedule is relevant - important bug fixes haven't
> > been making it into 2.7 for a very long time now, so the only change is
> the
> > rare security patch.
>
> Huh? 2.7 gets tons of changes: https://github.com/python/cpyt
> hon/commits/2.7


You're right. My experience is ending up on bugs.python.org when debugging
and the answer to "can this be backported to 2.7" usually being no - but it
looks like my experience is skewed by distutils, which is not exactly well
maintained.


> Officially CPython has 2 modes for releases: "regular support" and
> "security fixes only". 2.7 is special ? it get regular support, and
> then on top of that it also has a special exception to allow certain
> kinds of major changes, like the ssl module backports.


> If you know of important bug fixes that they're missing then I think
> they'd like to know :-).


> Anyway, the reason the CPython schedule is relevant is that once they
> drop support, it *will* stop getting security patches, so it will
> become increasingly impossible to use safely.
>

For web stuff yes, but not all that relevant for scientific work.


>
> >>
> >> If worst comes to worst we can always decide to
> >> extend it at the time... but if we make the announcement now, then
> >> it's less likely that we'll need to :-).
> >
> >
> > I'd be in favor of putting out a schedule in coordination with
> > scipy/pandas/etc, but it probably should look more like
> > - 2020: what's on http://www.python3statement.org/ now
> > - 2021: scipy / pandas / scikit-learn / etc.
>
> Um... pandas is already on python3statement.org right now :-)
>
> > - 2022: numpy
>
> Honestly I don't see why we should plan to support python 2 a day
> longer than our major downstream dependencies. That was the point of
> my first paragraph: for us the main benefit to supporting 2 is to
> avoid forcing our downstream dependencies to pin an old numpy. What's
> that extra year get us if they've already moved on?
>
> The other odd thing about this schedule is that you're suggesting that
> the organizing principle should be that the stack switches from
> top-of-stack to bottom... but then you left out the bottom of the
> stack! :-)
>

I don't think of Python as part of the stack, because it's not upgradeable
for most users (except for with conda). It's more like having a base
platform (OS + compilers + Python version) on which you install a
scientific stack which has numpy as its lowest level component.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170416/00674a08/attachment.html>

From antoine at python.org  Sun Apr 16 04:39:27 2017
From: antoine at python.org (Antoine Pitrou)
Date: Sun, 16 Apr 2017 10:39:27 +0200
Subject: [Numpy-discussion] Long term plans for dropping Python 2.7
In-Reply-To: <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
References: <CAPJVwBm2RMYfvG7OgdcuU5vmQ_-BxoQJUncq91Hj5GDE7N5a5Q@mail.gmail.com>
Message-ID: <ddacbfe3-3762-38dc-541e-93263159bc53@python.org>

On Fri, 14 Apr 2017 22:19:42 -0700
Nathaniel Smith <njs at pobox.com> wrote:
>
> From numpy's perspective, I feel like the most important reason to
> continue supporting 2.7 is our ability to convince people to keep
> upgrading. (Not the only reason, but the most important.) What I mean
> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
> numpy unavailable on python 2.7; it would just mean that lots of users
> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
> end of the world ? numpy is mature software and 1.12 works pretty
> well. The big problem IMO would be if this then meant that lots of
> downstream projects felt that they had to continue supporting 1.12
> going forward, which makes it very difficult for us to effectively
> ship new features or even bug fixes ? I mean, we can ship them, but
> no-one will use them.

Everyone using Python 3, which is a large and growing number of
people, will be able to use the new features.

I think the model you've outlined above -- a kind of "LTS" Numpy
version that supports 2.7 (with some amount of maintenance going on,
at least to fix important bugs), and later feature releases being
3.x-only, is the right way forward.  It will lighten maintenance of
later versions, allow the Numpy codebase to use modern Python idioms
and stdlib features, and will leave 2.x maintenance to people who
really care about it.


You may already have heard of it, but Django 1.11, which was just
released, is the last feature release to support Python 2.  Further
feature releases of Django will only support Python 3.
https://docs.djangoproject.com/en/1.11/releases/1.11/

Regards

Antoine.

From charlesr.harris at gmail.com  Wed Apr 19 14:28:32 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 19 Apr 2017 12:28:32 -0600
Subject: [Numpy-discussion] Relaxed stride checking fixup
Message-ID: <CAB6mnx+_vL08NTF45h+9c2b9Zq_QvxSKK6Gw0VcXipthN-EMqg@mail.gmail.com>

Hi All,

Currently numpy master has a bogus stride that will cause an error when
downstream projects misuse it. That is done in order to help smoke out
errors. Previously that bogus stride has been fixed up for releases, but
that requires a special patch to be applied after each version branch is
made. At this point I'd like to pick one or the other option and make the
development and release branches the same in this regard. The question is:
which option to choose? Keeping the fixup in master will remove some code
and keep things simple, while not fixing up the release will possibly lead
to more folks finding errors. At this point in time I am favoring applying
the fixup in master.

Thoughts?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170419/593e2bf6/attachment.html>

From ralf.gommers at gmail.com  Thu Apr 20 06:21:26 2017
From: ralf.gommers at gmail.com (Ralf Gommers)
Date: Thu, 20 Apr 2017 22:21:26 +1200
Subject: [Numpy-discussion] Relaxed stride checking fixup
In-Reply-To: <CAB6mnx+_vL08NTF45h+9c2b9Zq_QvxSKK6Gw0VcXipthN-EMqg@mail.gmail.com>
References: <CAB6mnx+_vL08NTF45h+9c2b9Zq_QvxSKK6Gw0VcXipthN-EMqg@mail.gmail.com>
Message-ID: <CABL7CQikxGPQd9jvbc4+KXNbe-U3-d918J53hcLeeNNhADyRBg@mail.gmail.com>

On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

> Hi All,
>
> Currently numpy master has a bogus stride that will cause an error when
> downstream projects misuse it. That is done in order to help smoke out
> errors. Previously that bogus stride has been fixed up for releases, but
> that requires a special patch to be applied after each version branch is
> made. At this point I'd like to pick one or the other option and make the
> development and release branches the same in this regard. The question is:
> which option to choose? Keeping the fixup in master will remove some code
> and keep things simple, while not fixing up the release will possibly lead
> to more folks finding errors. At this point in time I am favoring applying
> the fixup in master.
>
> Thoughts?
>

If we have to pick then keeping the fixup sounds reasonable. Would there be
value in making the behavior configurable at compile time? If there are
more such things and they'd be behind a __NUMPY_DEBUG__ switch, then people
may want to test that in their own CI.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4bcb7625/attachment.html>

From jtaylor.debian at googlemail.com  Thu Apr 20 09:15:27 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Thu, 20 Apr 2017 15:15:27 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string arrays
Message-ID: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>

Hello,
As you probably know numpy does not deal well with strings in Python3.
The np.string type is actually zero terminated bytes and not a string.
In Python2 this happened to work out as it treats bytes and strings the
same way. But in Python3 this type is pretty hard to work with as each
time you get an item from a numpy bytes array it needs decoding to
receive a string.
The only string type available in Python3 is np.unicode which uses
4-byte utf-32 encoding which is deemed to use too much memory to
actually see much use.

What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
latin1 encoding.
As we have been told we cannot change the np.string type to actually be
strings as existing programs do interpret its content as bytes despite
this being very broken due to its null terminating property (it will
ignore all trailing nulls).
Also 8 years of working around numpy's poor python3 support decisions in
third parties probably make the 'return bytes' behaviour impossible to
change now.

So we need a new dtype that can represent strings in numpy arrays which
is smaller than the existing 4 byte utf-32.

To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.


To actually do this we have two options both of which break our ABI when
doing so without ugly hacks.

- Add a new dtype, e.g. npy.realstring
By not modifying an existing type the only break programs using the
NPY_CHAR. The most notable case of this is f2py.
It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.

- Modify np.unicode to have encoding metadata
This allows use to reuse of all the type boilerplate so it is more
convenient to implement and by extending an existing type instead of
making one obsolete it results in a much nicer API.
The big drawback is that it will explicitly break any third party that
receives an array with a new encoding and assumes that the buffer of an
array of type np.unicode will a character itemsize of 4 bytes.
To ease this problem we would need to add API's to get the itemsize and
encoding to numpy now so third parties can error out cleanly.

The implementation of it is not that big a deal, I have already created
a prototype for adding latin1 metadata to np.unicode which works quite
well. It is imo realistic to get this into 1.14 should we be able to
make a decision on which way to implement it.

Do you have comments on how to go forward, in particular in regards to
new dtype vs modify np.unicode?

cheers,
Julian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9fc94c20/attachment.sig>

From peridot.faceted at gmail.com  Thu Apr 20 12:47:09 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Thu, 20 Apr 2017 16:47:09 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
Message-ID: <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>

On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor <jtaylor.debian at googlemail.com>
wrote:

> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

I should say first that I've never used even non-Unicode string arrays, but
is there any reason not to support all Unicode encodings that python does,
with the same names and semantics? This would surely be the simplest to
understand.

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
check with some non-Western users to make sure it's not going to wreck
their lives? I'd have selected ASCII as an encoding to treat specially, if
any, because Unicode already does that and the consequences are familiar.
(I'm used to writing and reading French without accents because it's passed
through ASCII, for example.)

Variable-length encodings, of which UTF-8 is obviously the one that makes
good handling essential, are indeed more complicated. But is it strictly
necessary that string arrays hold fixed-length *strings*, or can the
encoding length be fixed instead? That is, currently if you try to assign a
longer string than will fit, the string is truncated to the number of
characters in the data type. Instead, for encoded Unicode, the string could
be truncated so that the encoding fits. Of course this is not completely
trivial for variable-length encodings, but it should be doable, and it
would allow UTF-8 to be used just the way it usually is - as an encoding
that's almost 8-bit.

All this said, it seems to me that the important use cases for string
arrays involve interaction with existing binary formats, so people who have
to deal with such data should have the final say. (My own closest approach
to this is the FITS format, which is restricted by the standard to ASCII.)

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/692edab7/attachment.html>

From chris.barker at noaa.gov  Thu Apr 20 13:06:31 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Thu, 20 Apr 2017 10:06:31 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
Message-ID: <CALGmxEKUT6xPytQziiU6Rwhgfhn-bdGbfbRkoh6zm=Wfggaeag@mail.gmail.com>

Thanks so much for reviving this conversation -- we really do need to
address this.

My thoughts:

What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
>

Yes -- I think there is a real demand for that.


https://en.wikipedia.org/wiki/ISO/IEC_8859-15

To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datetime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>

I wonder if we really need that -- as you say, there is real demand for
compact string type, but for many use cases, 1 byte per character is
enough. So to keep things really simple, I think a single 1-byte per char
encoding would meet most people's needs.

What should that encoding be?

latin-1 is obvious (and has the very nice property of being able to
round-trip arbitrary bytes -- at least with Python's implementation) and
scientific data sets tend to use the latin alphabet (with its ascii roots
and all).

But there is now latin-9:

https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Maybe a better option?

Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a
given dtype always be the same bitwidth.

So the utf-32 dtype would be a different dtype. which also keeps it really
simple, we have a latin-* dtype and a full-on unicode dtype -- that's it.

Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
>

I think UTF-16 is very simply, the worst of both worlds. If we want a
two-byte character set, then it should be UCS-2 -- i.e. explicitly
rejecting any code point that takes more than two bytes to represent. (or
maybe that's what you mean by explicitly disallowing surrogate pairs). in
any case, it should certainly give you an encoding error if you try to pass
in a unicode character than can not fit into two bytes.

So: is there actually a demand for this? If so, then I think it should be a
separate 2-byte string type, with the encoding always the same.


> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

yeach -- utf-8 is great for interchange and streaming data, but not for
internal storage, particular with the numpy every item has the same number
of bytes requirement. So if someone wants to work with ut-8 they can store
it in a byte array, and encode and decode as they pass it to/from python.
That's going to have to happen anyway, even if under the hood. And it's
risky business -- if you truncate a utf-8 bytestring, you may get invalid
data --  it  really does not belong in numpy.


> - Add a new dtype, e.g. npy.realstring
>

I think that's the way to go. backwards compatibility is really key. Though
could we make the existing string dtype a latin-1 always type without
breaking too much? Or maybe depricate and get there in the future?

It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
>

I think the np.unicode type should remain as the 4-bytes per char encoding.
But that only makes sense if you follow my idea that we don't have a
variable number of bytes per char dtype.

So my proposal is:

 - Create a new one-byte-per-char dtype that is always latin-9 encoded.
    - in python3 it would map to a string (i.e. unicode)
 - Keep the 4-byte per char unicode string type

Optionally (if there is really demand)
 - Create a new two-byte per char dtype that is always UCS-2 encoded.


Is there any way to leverage Python3's nifty string type? I'm thinking not.
At least not for numpy arrays that can play well with C code, etc.

All that being said, a encoding-specified string dtype would be nice too --
I just think it's more complex that it needs to be. Numpy is not the tool
for text processing...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/7157c7a5/attachment.html>

From shoyer at gmail.com  Thu Apr 20 13:26:13 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 20 Apr 2017 10:26:13 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
Message-ID: <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>

Julian -- thanks for taking this on. NumPy's handling of strings on Python
3 certainly needs fixing.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:

> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type. Instead, for encoded Unicode, the string could
> be truncated so that the encoding fits. Of course this is not completely
> trivial for variable-length encodings, but it should be doable, and it
> would allow UTF-8 to be used just the way it usually is - as an encoding
> that's almost 8-bit.
>

I agree with Anne here. Variable-length encoding would be great to have,
but even fixed length UTF-8 (in terms of memory usage, not characters)
would solve NumPy's Python 3 string problem. NumPy's memory model needs a
fixed size per array element, but that doesn't mean we need a fixed sized
per character. Each element in a UTF-8 array would be a string with a fixed
number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and
memory usage: np.string_ uses null padding to store shorter strings in a
larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for
memory-mapping arrays stored with those encodings, but that seems like a
lot of extra trouble for little gain.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/73e41b16/attachment-0001.html>

From chris.barker at noaa.gov  Thu Apr 20 13:28:14 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Thu, 20 Apr 2017 10:28:14 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
Message-ID: <CALGmxEJHRmn8cAs1az69EZQACKAAcNi+V3fCgc_WMGjaAtgPYw@mail.gmail.com>

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:

> Is there any reason not to support all Unicode encodings that python does,
> with the same names and semantics? This would surely be the simplest to
> understand.
>

I think it should support all fixed-length encodings, but not the non-fixed
length ones -- they just don't fit well into the numpy data model.


> Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
> check with some non-Western users to make sure it's not going to wreck
> their lives? I'd have selected ASCII as an encoding to treat specially, if
> any, because Unicode already does that and the consequences are familiar.
> (I'm used to writing and reading French without accents because it's passed
> through ASCII, for example.)
>

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of
the accented characters for the European language and some symbols that are
nice to have (I use the degree symbol a lot...). And it is ASCII compatible
-- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the
community -- is there a substantial demand for a non-latin one-byte per
character encoding?


> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type.
>

we could do that, yes, but an improperly truncated "string" becomes invalid
-- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on
this!

Note: if you are storing a LOT of text (which I have no idea why you would
use numpy anyway), then the memory size might matter, but then
semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of
datasets, ids, etc, etc -- not massive amounts of text -- so storage space
really isn't critical. but having an id or something unexpectedly truncated
could be bad.

I think practical experience has shown us that people do not handle "mostly
fixed length but once in awhile not" text well -- see the nightmare of
UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors
are far more likely to be found in tests (why would you use utf-8 is all
your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for
interoperability with other systems -- but that makes errors more of an
issue -- if it doesn't pass through the numpy truncation machinery, invalid
data could easily get put in a numpy array.

-CHB

 it would allow UTF-8 to be used just the way it usually is - as an
> encoding that's almost 8-bit.
>

ouch! that perception is the route to way too many errors! it is by no
means almost 8-bit, unless your data are almost ascii -- in which case, use
latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it,
and only test it with mostly-ascii text, and not find the bugs that will
crop up later.

All this said, it seems to me that the important use cases for string
> arrays involve interaction with existing binary formats, so people who have
> to deal with such data should have the final say. (My own closest approach
> to this is the FITS format, which is restricted by the standard to ASCII.)
>

yup -- not sure we'll get much guidance here though -- netdf does not solve
this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file --
it's probably better to pull it out as bytes and pass it through the python
decoding/encoding machinery than pasting the bytes straight to a numpy
array and hope that the encoding and truncation are correct.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/0f87060b/attachment.html>

From ndbecker2 at gmail.com  Thu Apr 20 13:36:42 2017
From: ndbecker2 at gmail.com (Neal Becker)
Date: Thu, 20 Apr 2017 17:36:42 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEJHRmn8cAs1az69EZQACKAAcNi+V3fCgc_WMGjaAtgPYw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
 <CALGmxEJHRmn8cAs1az69EZQACKAAcNi+V3fCgc_WMGjaAtgPYw@mail.gmail.com>
Message-ID: <CAG3t+pGDDp_JL2-qWb_aFQhU3bBFZLfeZodWg0Fvovfis+O6Dw@mail.gmail.com>

I'm no unicode expert, but can't we truncate unicode strings so that only
valid characters are included?

On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com
> > wrote:
>
>> Is there any reason not to support all Unicode encodings that python
>> does, with the same names and semantics? This would surely be the simplest
>> to understand.
>>
>
> I think it should support all fixed-length encodings, but not the
> non-fixed length ones -- they just don't fit well into the numpy data model.
>
>
>> Also, if latin1 is to going to be the only practical 8-bit encoding,
>> maybe check with some non-Western users to make sure it's not going to
>> wreck their lives? I'd have selected ASCII as an encoding to treat
>> specially, if any, because Unicode already does that and the consequences
>> are familiar. (I'm used to writing and reading French without accents
>> because it's passed through ASCII, for example.)
>>
>
> latin-1 (or latin-9) only makes things better than ASCII -- it buys most
> of the accented characters for the European language and some symbols that
> are nice to have (I use the degree symbol a lot...). And it is ASCII
> compatible -- so there is NO reason to choose ASCII over Latin-*
>
> Which does no good for non-latin languages -- so we need to hear from the
> community -- is there a substantial demand for a non-latin one-byte per
> character encoding?
>
>
>> Variable-length encodings, of which UTF-8 is obviously the one that makes
>> good handling essential, are indeed more complicated. But is it strictly
>> necessary that string arrays hold fixed-length *strings*, or can the
>> encoding length be fixed instead? That is, currently if you try to assign a
>> longer string than will fit, the string is truncated to the number of
>> characters in the data type.
>>
>
> we could do that, yes, but an improperly truncated "string" becomes
> invalid -- just seems like a recipe for bugs that won't be found in testing.
>
> memory is cheap, compressing is fast -- we really shouldn't get hung up on
> this!
>
> Note: if you are storing a LOT of text (which I have no idea why you would
> use numpy anyway), then the memory size might matter, but then
> semi-arbitrary truncation would probably matter, too.
>
> I expect most text storage in numpy arrays is things like names of
> datasets, ids, etc, etc -- not massive amounts of text -- so storage space
> really isn't critical. but having an id or something unexpectedly truncated
> could be bad.
>
> I think practical experience has shown us that people do not handle
> "mostly fixed length but once in awhile not" text well -- see the nightmare
> of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so
> errors are far more likely to be found in tests (why would you use utf-8 is
> all your data are in ascii???). but still -- why invite hard to test for
> errors?
>
> Final point -- as Julian suggests, one reason to support utf-8 is for
> interoperability with other systems -- but that makes errors more of an
> issue -- if it doesn't pass through the numpy truncation machinery, invalid
> data could easily get put in a numpy array.
>
> -CHB
>
>  it would allow UTF-8 to be used just the way it usually is - as an
>> encoding that's almost 8-bit.
>>
>
> ouch! that perception is the route to way too many errors! it is by no
> means almost 8-bit, unless your data are almost ascii -- in which case, use
> latin-1 for pity's sake!
>
> This highlights my point though -- if we support UTF-8, people WILL use
> it, and only test it with mostly-ascii text, and not find the bugs that
> will crop up later.
>
> All this said, it seems to me that the important use cases for string
>> arrays involve interaction with existing binary formats, so people who have
>> to deal with such data should have the final say. (My own closest approach
>> to this is the FITS format, which is restricted by the standard to ASCII.)
>>
>
> yup -- not sure we'll get much guidance here though -- netdf does not
> solve this problem well, either.
>
> But if you are pulling, say, a utf-8 encoded string out of a netcdf file
> -- it's probably better to pull it out as bytes and pass it through the
> python decoding/encoding machinery than pasting the bytes straight to a
> numpy array and hope that the encoding and truncation are correct.
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/b6d867b3/attachment-0001.html>

From chris.barker at noaa.gov  Thu Apr 20 13:43:18 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Thu, 20 Apr 2017 10:43:18 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
 <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
Message-ID: <CALGmxEKkRCmYZX4eha5EJxaSY=Ljo7BQsCjivDTzvcqcYb4a1g@mail.gmail.com>

On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
>

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed
number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would
be able to hold, say, N characters (actually code points, combining
characters make this even more confusing) then you would need to allocate
N*4 bytes to make sure you could hold any string that long. Which would be
pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense --
you'd specify say N characters, numpy would arbitrarily (or user specified)
over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
string that didn't fit. Then you'd need to make sure you truncated
correctly, so as not to create an invalid string (that's just code, it
could be made correct).

But how much to over allocate? for english text, with an occasional
scientific symbol, only a little. for, say, Japanese text, you'd need a
factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really
dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
>

sure -- but it is clear to the user that the dtype can hold "up to this
many" characters.


> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.
>

I see it the other way around -- the only reason TO support utf-8 is for
memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for
all unicode support, rather than messing around with all the multiple
encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though --
scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it
would mostly "just work"

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/fa2ff192/attachment.html>

From chris.barker at noaa.gov  Thu Apr 20 13:46:31 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Thu, 20 Apr 2017 10:46:31 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAG3t+pGDDp_JL2-qWb_aFQhU3bBFZLfeZodWg0Fvovfis+O6Dw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
 <CALGmxEJHRmn8cAs1az69EZQACKAAcNi+V3fCgc_WMGjaAtgPYw@mail.gmail.com>
 <CAG3t+pGDDp_JL2-qWb_aFQhU3bBFZLfeZodWg0Fvovfis+O6Dw@mail.gmail.com>
Message-ID: <CALGmxELyCCMUeODp9PiS5BVP4-TkcJbP4NKqZ42Hob+d1z9w_Q@mail.gmail.com>

On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker <ndbecker2 at gmail.com> wrote:

> I'm no unicode expert, but can't we truncate unicode strings so that only
> valid characters are included?
>

sure -- it's just a bit fiddly -- and you need to make sure that everything
gets passed through the proper mechanism. numpy is all about folks using
other code to mess with the bytes in a numpy array. so we can't expect that
all numpy string arrays will have been created with numpy code.

Does python's string have a truncated encode option? i.e. you don't want to
encode to utf-8 and then just chop it off.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e39cb7e3/attachment.html>

From wieser.eric+numpy at gmail.com  Thu Apr 20 13:58:26 2017
From: wieser.eric+numpy at gmail.com (Eric Wieser)
Date: Thu, 20 Apr 2017 17:58:26 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKkRCmYZX4eha5EJxaSY=Ljo7BQsCjivDTzvcqcYb4a1g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
 <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
 <CALGmxEKkRCmYZX4eha5EJxaSY=Ljo7BQsCjivDTzvcqcYb4a1g@mail.gmail.com>
Message-ID: <CAL1kJvDYbF49pF-2voENk+YaNmFVSgQQA6tnSWo6jkqG9KVWbw@mail.gmail.com>

> if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation
either, as combining characters are a thing. So I don't think this is a
good argument against UTF8.

Also, is silent truncation a think that we want to allow to happen anyway?
That sounds like something the user ought to be alerted to with an
exception.

> if you wanted to specify that a numpy element would be able to hold, say,
N characters
> ...
> It simply is not the right way to handle text if [...] you need
fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due
to combining characters. The only two meaningful things to count are:
* Graphemes, as that's what the user sees visually. These can span multiple
code-points
* Bytes of encoded data, as that's the space needed to store them

So I would argue that the approach of fixed-codepoint-length storage is
itself a flawed design, and so should not be used as a constraint on numpy.

Counting graphemes is hard, so that leaves the only sensible option as a
byte count.

I don't forsee variable-length encodings being a problem
implementation-wise - they only become one if numpy were to acquire a
vectorized substring function that is intended to return a view.

I think I'd be in favor of supporting all encodings, and falling back on
python to handle encoding/decoding them.


On Thu, 20 Apr 2017 at 18:44 Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
>>
>
> Ah, yes -- the nightmare of Unicode!
>
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
>
> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
>
> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
>
> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
>
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
>
> you need fixed-length storage
> you care about compactness
>
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
>>
>
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>
>
>> The only reason I see for supporting encodings other than UTF-8 is for
>> memory-mapping arrays stored with those encodings, but that seems like a
>> lot of extra trouble for little gain.
>>
>
> I see it the other way around -- the only reason TO support utf-8 is for
> memory mapping with other systems that use it :-)
>
> On the other hand,  if we ARE going to support utf-8 -- maybe use it for
> all unicode support, rather than messing around with all the multiple
> encoding options.
>
> I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
>
> All that being said, if the truncation code were carefully written, it
> would mostly "just work"
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/c9719352/attachment-0001.html>

From jtaylor.debian at googlemail.com  Thu Apr 20 14:15:49 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Thu, 20 Apr 2017 20:15:49 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
Message-ID: <fea34b23-e1ea-d824-b035-5bd0cf2f10ed@googlemail.com>

I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.

On 20.04.2017 15:15, Julian Taylor wrote:
> Hello,
> As you probably know numpy does not deal well with strings in Python3.
> The np.string type is actually zero terminated bytes and not a string.
> In Python2 this happened to work out as it treats bytes and strings the
> same way. But in Python3 this type is pretty hard to work with as each
> time you get an item from a numpy bytes array it needs decoding to
> receive a string.
> The only string type available in Python3 is np.unicode which uses
> 4-byte utf-32 encoding which is deemed to use too much memory to
> actually see much use.
> 
> What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
> As we have been told we cannot change the np.string type to actually be
> strings as existing programs do interpret its content as bytes despite
> this being very broken due to its null terminating property (it will
> ignore all trailing nulls).
> Also 8 years of working around numpy's poor python3 support decisions in
> third parties probably make the 'return bytes' behaviour impossible to
> change now.
> 
> So we need a new dtype that can represent strings in numpy arrays which
> is smaller than the existing 4 byte utf-32.
> 
> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
> 
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
> 
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
> 
> 
> To actually do this we have two options both of which break our ABI when
> doing so without ugly hacks.
> 
> - Add a new dtype, e.g. npy.realstring
> By not modifying an existing type the only break programs using the
> NPY_CHAR. The most notable case of this is f2py.
> It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
> 
> - Modify np.unicode to have encoding metadata
> This allows use to reuse of all the type boilerplate so it is more
> convenient to implement and by extending an existing type instead of
> making one obsolete it results in a much nicer API.
> The big drawback is that it will explicitly break any third party that
> receives an array with a new encoding and assumes that the buffer of an
> array of type np.unicode will a character itemsize of 4 bytes.
> To ease this problem we would need to add API's to get the itemsize and
> encoding to numpy now so third parties can error out cleanly.
> 
> The implementation of it is not that big a deal, I have already created
> a prototype for adding latin1 metadata to np.unicode which works quite
> well. It is imo realistic to get this into 1.14 should we be able to
> make a decision on which way to implement it.
> 
> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?
> 
> cheers,
> Julian
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/70b27a74/attachment.sig>

From shoyer at gmail.com  Thu Apr 20 14:16:34 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 20 Apr 2017 11:16:34 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKkRCmYZX4eha5EJxaSY=Ljo7BQsCjivDTzvcqcYb4a1g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CANm_+ZqmdJrs+pmJ25AYJ1Cv09pojdtkhzEKZyJKc_hhUkHwYw@mail.gmail.com>
 <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
 <CALGmxEKkRCmYZX4eha5EJxaSY=Ljo7BQsCjivDTzvcqcYb4a1g@mail.gmail.com>
Message-ID: <CAEQ_TvcUgzGdTqVY-snVEtC_QBsX4AAHo4mVSqSviuThcQQm0A@mail.gmail.com>

On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
>>
>
> Ah, yes -- the nightmare of Unicode!
>
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
>

Apologies for confusing the terminology! Yes, this would mean a fixed
number of bytes and an unknown number of characters.


> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
>

It's already unsafe to try to insert arbitrary length strings into a numpy
string_ or unicode_ array. When determining the dtype automatically (e.g.,
with np.array(list_of_strings)), the difference is that numpy would need to
check the maximum encoded length instead of the character length (i.e.,
len(x.encode() instead of len(x)).

I certainly would not over-allocate. If users want more space, they can
explicitly choose an appropriate size. (This is an hazard of not having
length length dtypes.)

If users really want to be able to fit an arbitrary number of unicode
characters and aren't concerned about memory usage, they can still use
np.unicode_ -- that won't be going away.


> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
>

NumPy already does this sort of silent truncation with longer strings
inserted into shorter string dtypes. The different here would indeed be the
need to check the number of bytes represented by the string instead of the
number of characters.

But I don't think this is useful behavior to bring over to a new dtype. We
should error instead of silently truncating. This is certainly easier than
trying to figure out when we would be splitting a character.


> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
>
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
>
> you need fixed-length storage
> you care about compactness
>
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
>>
>
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>

As Yu Feng points out in this GitHub comment, non-latin language speakers
are already aware of the difference between string length and bytes length:
https://github.com/numpy/numpy/pull/8942#issuecomment-294409192

Making an API based on code units instead of code points really seems like
the saner way to handle unicode strings. I agree with this section with the
DyND design docs for it's string type, which notes precedent from Julia and
Go:
https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code-unit-api-not-code-point

I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
>

I think scientific users tend be to ASCII only, so UTF-8 would also work
transparently :).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/77463a30/attachment-0001.html>

From charlesr.harris at gmail.com  Thu Apr 20 14:22:52 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Thu, 20 Apr 2017 12:22:52 -0600
Subject: [Numpy-discussion] Relaxed stride checking fixup
In-Reply-To: <CABL7CQikxGPQd9jvbc4+KXNbe-U3-d918J53hcLeeNNhADyRBg@mail.gmail.com>
References: <CAB6mnx+_vL08NTF45h+9c2b9Zq_QvxSKK6Gw0VcXipthN-EMqg@mail.gmail.com>
 <CABL7CQikxGPQd9jvbc4+KXNbe-U3-d918J53hcLeeNNhADyRBg@mail.gmail.com>
Message-ID: <CAB6mnxKHM+kFJB-j4gaW198X_12zzvyXoQJQNj95Pfpp22k97A@mail.gmail.com>

On Thu, Apr 20, 2017 at 4:21 AM, Ralf Gommers <ralf.gommers at gmail.com>
wrote:

>
>
> On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>> Hi All,
>>
>> Currently numpy master has a bogus stride that will cause an error when
>> downstream projects misuse it. That is done in order to help smoke out
>> errors. Previously that bogus stride has been fixed up for releases, but
>> that requires a special patch to be applied after each version branch is
>> made. At this point I'd like to pick one or the other option and make the
>> development and release branches the same in this regard. The question is:
>> which option to choose? Keeping the fixup in master will remove some code
>> and keep things simple, while not fixing up the release will possibly lead
>> to more folks finding errors. At this point in time I am favoring applying
>> the fixup in master.
>>
>> Thoughts?
>>
>
> If we have to pick then keeping the fixup sounds reasonable. Would there
> be value in making the behavior configurable at compile time? If there are
> more such things and they'd be behind a __NUMPY_DEBUG__ switch, then people
> may want to test that in their own CI.
>

Interesting thought. I wonder what else might be a good candidate for such
a switch?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/710ba1dd/attachment.html>

From antoine at python.org  Thu Apr 20 14:23:11 2017
From: antoine at python.org (Antoine Pitrou)
Date: Thu, 20 Apr 2017 20:23:11 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
References: <CAEQ_Tvf2ShnbsQrnJDprN1QcqEbdO_H4MNy6ZGVesCzjQO-P8g@mail.gmail.com>
Message-ID: <ca8c8c1b-e1e0-36db-9097-069619a1b5b2@python.org>

On Thu, 20 Apr 2017 10:26:13 -0700
Stephan Hoyer <shoyer at gmail.com> wrote:
> 
> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
> 
> In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
> 
> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.  

I think you want at least: ascii, utf8, ucs2 (aka utf16 without
surrogates), utf32.  That is, 3 common fixed width encodings and one
variable width encoding.

Regards

Antoine.

From robert.kern at gmail.com  Thu Apr 20 14:53:53 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 11:53:53 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
Message-ID: <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the
current sub-optimal situation because we never really laid out the use
cases. We just felt like we needed bytestring and unicode dtypes, more out
of completionism than anything, and we made a bunch of assumptions just to
get each one done. I think there may be broad agreement that many of those
assumptions are "wrong", but it would be good to reference that against
concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code,
I'm going to use dtype=object a la pandas. It has almost no arbitrary
constraints, and I can rely on Python's unicode facilities freely. There
may be some cases where it's a little less memory-efficient (e.g.
representing a column of enumerated single-character values like 'M'/'F'),
but that's never prevented me from doing anything (compare to the
uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data
conveniently laid out according to numpy assumptions (e.g. FITS). Being
able to work with C/C++/Fortran APIs that have arrays of strings laid out
according to numpy assumptions (e.g. HDF5). I think it would behoove us to
canvass the needs of these formats and APIs before making any more
assumptions.

For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and
Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype,
but perhaps add a new canonical alias, document it as being for those
specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/515c0c94/attachment.html>

From shoyer at gmail.com  Thu Apr 20 15:05:11 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 20 Apr 2017 12:05:11 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
Message-ID: <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>

On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com> wrote:

> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
>

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
https://github.com/PyTables/PyTables/issues/499
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/30f0bb65/attachment-0001.html>

From peridot.faceted at gmail.com  Thu Apr 20 14:59:44 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Thu, 20 Apr 2017 18:59:44 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <fea34b23-e1ea-d824-b035-5bd0cf2f10ed@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <fea34b23-e1ea-d824-b035-5bd0cf2f10ed@googlemail.com>
Message-ID: <CANm_+ZqsVfrWfZHo6O1wL7sXev4+7AuTw48jFWJftAJvMRPqdg@mail.gmail.com>

On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <jtaylor.debian at googlemail.com>
wrote:

> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
>
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
>

Creating a new dtype to handle encoded unicode, with the encoding specified
in the dtype, sounds perfectly reasonable to me. Changing the behaviour of
the existing unicode dtype seems like it's going to lead to massive
headaches unless exactly nobody uses it. The only downside to a new type is
having to find an obvious name that isn't already in use. (And having to
actively  maintain/deprecate the old one.)

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/95086bde/attachment.html>

From wieser.eric+numpy at gmail.com  Thu Apr 20 15:15:33 2017
From: wieser.eric+numpy at gmail.com (Eric Wieser)
Date: Thu, 20 Apr 2017 19:15:33 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqsVfrWfZHo6O1wL7sXev4+7AuTw48jFWJftAJvMRPqdg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <fea34b23-e1ea-d824-b035-5bd0cf2f10ed@googlemail.com>
 <CANm_+ZqsVfrWfZHo6O1wL7sXev4+7AuTw48jFWJftAJvMRPqdg@mail.gmail.com>
Message-ID: <CAL1kJvCecCwv6fDP+W4-VZkTsodcT=3oqEydE8Ns4gO=te-nwg@mail.gmail.com>

Perhaps `np.encoded_str[encoding]` as the name for the new type, if we
decide a new type is necessary?

Am I right in thinking that the general problem here is that it's very easy
to discard metadata when working with dtypes, and that by adding metadata
to `unicode_`, we risk existing code carelessly dropping it? Is this a
problem in both C and python, or just C?

If that's the case, can we end up with a compromise where being careless
just causes old code to promote to ucs32?

On Thu, 20 Apr 2017 at 20:09 Anne Archibald <peridot.faceted at gmail.com>
wrote:

> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>
>> I probably have formulated my goal with the proposal a bit better, I am
>> not very interested in a repetition of which encoding to use debate.
>> In the end what will be done allows any encoding via a dtype with
>> metadata like datetime.
>> This allows any codec (including truncated utf8) to be added easily (if
>> python supports it) and allows sidestepping the debate.
>>
>> My main concern is whether it should be a new dtype or modifying the
>> unicode dtype. Though the backward compatibility argument is strongly in
>> favour of adding a new dtype that makes the np.unicode type redundant.
>>
>
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a new
> type is having to find an obvious name that isn't already in use. (And
> having to actively  maintain/deprecate the old one.)
>
> Anne
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/3c5be976/attachment.html>

From peridot.faceted at gmail.com  Thu Apr 20 15:17:39 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Thu, 20 Apr 2017 19:17:39 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
Message-ID: <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>

On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>

+1


> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>

Actually if I understood the spec, FITS header lines are 80 bytes long and
contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

[...]

> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>

How would this differ from a numpy array of bytes with one more dimension?


> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>

I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps?  Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9c017865/attachment-0001.html>

From robert.kern at gmail.com  Thu Apr 20 15:17:48 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 12:17:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
Message-ID: <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.
>
>
> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
> https://github.com/PyTables/PyTables/issues/499
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/af8aa9a7/attachment.html>

From charlesr.harris at gmail.com  Thu Apr 20 15:24:35 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Thu, 20 Apr 2017 13:24:35 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
Message-ID: <CAB6mnxKxE8-BGs6US4A_iu=APYJ40To+CqqQxJC+f=hGwWb6BQ@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>
> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>
> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
>
> We should look at some of the newer formats and APIs, like Parquet and
> Arrow, and also consider the cross-language APIs with Julia and R.
>
> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>
> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>
>
A little history, IIRC, storing null terminated strings in fixed byte
lengths was done in Fortran, strings were  usually stored in
integers/integer_arrays.

If memory mapping of arbitrary types is not important, I'd settle for ascii
or latin-1, utf-8 fixed byte length, and arrays of fixed python object
type. Using one byte encodings and utf-8 avoids needing to deal with
endianess.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4865d472/attachment.html>

From jtaylor.debian at googlemail.com  Thu Apr 20 15:27:17 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Thu, 20 Apr 2017 21:27:17 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
Message-ID: <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com>

On 20.04.2017 20:53, Robert Kern wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>> Do you have comments on how to go forward, in particular in regards to
>> new dtype vs modify np.unicode?
> 
> Can we restate the use cases explicitly? I feel like we ended up with
> the current sub-optimal situation because we never really laid out the
> use cases. We just felt like we needed bytestring and unicode dtypes,
> more out of completionism than anything, and we made a bunch of
> assumptions just to get each one done. I think there may be broad
> agreement that many of those assumptions are "wrong", but it would be
> good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to
break compatibility when python3 support was added.
We should have made the string dtype an encoded byte type (ascii or
latin1) in python3 instead of null terminated unencoded bytes which do
not make very much practical sense.

So the use case is very simple: Give users of the string dtype a
migration path that does not involve converting to full utf32 unicode.
The latin1 encoded bytes dtype would allow that.

As we already have the infrastructure this same dtype can allow more
than just latin1 with minimal effort, for the fixed size python
supported stuff it is literally adding an enum entry, two new switch
clauses and a little bit of dtype string parsing and testcases.


Having some form of variable string handling would be nice. But this is
another topic all together.
Having builtin support for variable strings only seems overkill as the
string dtype is not that important and object arrays should work
reasonably well for this usecase already.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4cc2fe7a/attachment-0001.sig>

From rainwoodman at gmail.com  Thu Apr 20 15:34:24 2017
From: rainwoodman at gmail.com (Feng Yu)
Date: Thu, 20 Apr 2017 12:34:24 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
 <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
Message-ID: <CANGhFja-HDOYk4GMeYNgnT+fmuxhhG4KK+VGWMHyHJscdd4Pzg@mail.gmail.com>

I suggest a new data type  'text[encoding]', 'T'.

1. text can be cast to python strings via decoding.

2. Conceptually casting to python bytes first cast to a string then
calls encode(); the current encoding in the meta data is used by
default, but the new encoding can be overridden.

I slightly favour 'T16' as a fixed size, text record backed by 16
bytes. This way over-allocation is forcefully delegated to the user,
simplifying numpy array.


Yu

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
>> wrote:
>>>
>>> I don't know of a format off-hand that works with numpy uniform-length
>>> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
>>> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
>>> UTF8 strings.
>>
>>
>> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
>> variable length versions:
>> https://github.com/PyTables/PyTables/issues/499
>> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>>
>> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
>> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

From jtaylor.debian at googlemail.com  Thu Apr 20 15:40:12 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Thu, 20 Apr 2017 21:40:12 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqsVfrWfZHo6O1wL7sXev4+7AuTw48jFWJftAJvMRPqdg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <fea34b23-e1ea-d824-b035-5bd0cf2f10ed@googlemail.com>
 <CANm_+ZqsVfrWfZHo6O1wL7sXev4+7AuTw48jFWJftAJvMRPqdg@mail.gmail.com>
Message-ID: <fff00229-d194-8ecb-8083-5315b7f25d95@googlemail.com>

On 20.04.2017 20:59, Anne Archibald wrote:
> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>     I probably have formulated my goal with the proposal a bit better, I am
>     not very interested in a repetition of which encoding to use debate.
>     In the end what will be done allows any encoding via a dtype with
>     metadata like datetime.
>     This allows any codec (including truncated utf8) to be added easily (if
>     python supports it) and allows sidestepping the debate.
> 
>     My main concern is whether it should be a new dtype or modifying the
>     unicode dtype. Though the backward compatibility argument is strongly in
>     favour of adding a new dtype that makes the np.unicode type redundant.
> 
> 
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a
> new type is having to find an obvious name that isn't already in use.
> (And having to actively  maintain/deprecate the old one.) 
> 
> Anne
> 

We wouldn't really be changing the behaviour of the unicode dtype. Only
programs accessing the databuffer directly and trying to decode would
need to be changed.

I assume this can happen for programs that do serialization + reencoding
of numpy string arrays at the C level (at the python level you would be
fine).
These programs would be broken, but only when they actually receive a
string array that does not have the default utf32 encoding.

I really don't like that a fully new dtype means creating more junk and
extra code paths to numpy.
But it is probably do big of a compatibility break to accept to keep our
code clean.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e506bfc9/attachment.sig>

From robert.kern at gmail.com  Thu Apr 20 15:46:21 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 12:46:21 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
Message-ID: <CAF6FJivBSbuJrctcbr98+EuWnnp+Hai5EefCbDyu+vaLR+noJw@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald <peridot.faceted at gmail.com>
wrote:
>
> On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern at gmail.com> wrote:

>> For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.
>
> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

Never mind, then. :-)

>> If I had to jump ahead and propose new dtypes, I might suggest this:
>>
>> * For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.
>>
>> * Acknowledge the use cases of the current NULL-terminated np.string
dtype, but perhaps add a new canonical alias, document it as being for
those specific use cases, and deprecate/de-emphasize the current name.
>>
>> * Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.
>
> How would this differ from a numpy array of bytes with one more
dimension?

The scalar in the implementation being the scalar in the use case,
immutability of the scalar, directly working with b'' strings in and out
(and thus work with the Python codecs easily).

>> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).
>
> I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps? Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

HDF5 seems to support this, but only for ASCII and UTF8, not a large list
of encodings.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9ea8d3ea/attachment.html>

From shoyer at gmail.com  Thu Apr 20 15:51:57 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 20 Apr 2017 12:51:57 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
 <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
Message-ID: <CAEQ_Tvfm65Jyqc6FMJFT+To+ZckNepo0o-tmjdM_KHec1L6vHg@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> >
> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
> >
> >
> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
> and variable length versions:
> > https://github.com/PyTables/PyTables/issues/499
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
> >
> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e58c5f00/attachment-0001.html>

From robert.kern at gmail.com  Thu Apr 20 16:00:48 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 13:00:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <659e2b27-b952-4db7-e9b1-9364681f8aa8@googlemail.com>
Message-ID: <CAF6FJisSxpH0JA5G7O=w+BbATg4JzLBxkaYn3z6x1fS5dtJpgg@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
>
> On 20.04.2017 20:53, Robert Kern wrote:
> > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> > wrote:
> >
> >> Do you have comments on how to go forward, in particular in regards to
> >> new dtype vs modify np.unicode?
> >
> > Can we restate the use cases explicitly? I feel like we ended up with
> > the current sub-optimal situation because we never really laid out the
> > use cases. We just felt like we needed bytestring and unicode dtypes,
> > more out of completionism than anything, and we made a bunch of
> > assumptions just to get each one done. I think there may be broad
> > agreement that many of those assumptions are "wrong", but it would be
> > good to reference that against concretely-stated use cases.
>
> We ended up in this situation because we did not take the opportunity to
> break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy
1.0. There never was an explicitly fleshed out use case for unicode arrays
other than "Python has unicode strings, so we should have a string dtype
that supports it". Hence the "we only support UCS4" implementation; it's
not like anyone *wants* UCS4 or interoperates with UCS4, but it does
represent all possible Unicode strings. The Python 3 transition merely
exacerbated the problem by making Unicode strings the primary string type
to work with. I don't really want to ameliorate the exacerbation without
addressing the root problem, which is worth solving.

I will put this down as a marker use case: Support HDF5's fixed-width UTF-8
arrays.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/8181e475/attachment.html>

From m.h.vankerkwijk at gmail.com  Thu Apr 20 16:01:25 2017
From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk)
Date: Thu, 20 Apr 2017 16:01:25 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANGhFja-HDOYk4GMeYNgnT+fmuxhhG4KK+VGWMHyHJscdd4Pzg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
 <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
 <CANGhFja-HDOYk4GMeYNgnT+fmuxhhG4KK+VGWMHyHJscdd4Pzg@mail.gmail.com>
Message-ID: <CAJNV+9tafdR1TqiPvrwMjGcdKessvq8SDq1UTmzUj0TFPQVq=Q@mail.gmail.com>

> I suggest a new data type  'text[encoding]', 'T'.

I like the suggestion very much (it is even in between S and U!). The
utf-8 manifesto linked to above convinced me that the number that
should follow is the number of bytes, which is nicely consistent with
use in all numerical dtypes.

Any way, more specifically on Julian's question: it seems to me one
has little choice but to make a new dtype (and OK if that makes
unicode obsolete). I think what exact encodings to support is a
separate question.

-- Marten

From robert.kern at gmail.com  Thu Apr 20 16:04:33 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 13:04:33 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_Tvfm65Jyqc6FMJFT+To+ZckNepo0o-tmjdM_KHec1L6vHg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CAEQ_TvehpGG8wRYeP8v9NuaVOH4zOfj9U=dJ-pu-9s9htQEhhA@mail.gmail.com>
 <CAF6FJisk-UbYwYa88vPdw3ruA9T5wvaRpDuyRGSDhMdgU8t-DQ@mail.gmail.com>
 <CAEQ_Tvfm65Jyqc6FMJFT+To+ZckNepo0o-tmjdM_KHec1L6vHg@mail.gmail.com>
Message-ID: <CAF6FJiuBCc-3_JGyhY9YYpCtRu8KkSag0poZs8pALGhCH452ZA@mail.gmail.com>

On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>> >
>> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
wrote:
>> >>
>> >> I don't know of a format off-hand that works with numpy
uniform-length strings and Unicode as well. HDF5 (to my recollection)
supports arrays of NULL-terminated, uniform-length ASCII like FITS, but
only variable-length UTF8 strings.
>> >
>> >
>> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
and variable length versions:
>> > https://github.com/PyTables/PyTables/issues/499
>> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>> >
>> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
>>
>> Ah, okay, I was interpolating from a quick perusal of the h5py docs,
which of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.
>
> Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).

"... well enough with np.string's semantics [that h5py actually used it to
pass data in and out; whether that array is fit for purpose beyond that, I
won't comment]." :-)

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4573873e/attachment.html>

From hodge at stsci.edu  Thu Apr 20 16:16:40 2017
From: hodge at stsci.edu (Phil Hodge)
Date: Thu, 20 Apr 2017 16:16:40 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
Message-ID: <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>

On 04/20/2017 03:17 PM, Anne Archibald wrote:
> Actually if I understood the spec, FITS header lines are 80 bytes long 
> and contain ASCII with no NULLs; strings are quoted and trailing 
> spaces are stripped.
>

FITS BINTABLE extensions can have columns containing strings, and in 
that case the values are NULL terminated, except that if the string 
fills the field (i.e. there's no room for a NULL), the NULL will not be 
written.

Phil

From robert.kern at gmail.com  Thu Apr 20 18:20:32 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 20 Apr 2017 15:20:32 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
Message-ID: <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>

On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge <hodge at stsci.edu> wrote:
>
> On 04/20/2017 03:17 PM, Anne Archibald wrote:
>>
>> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.
>
> FITS BINTABLE extensions can have columns containing strings, and in that
case the values are NULL terminated, except that if the string fills the
field (i.e. there's no room for a NULL), the NULL will not be written.

Ah, that's what I was thinking of, thank you.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/a0e89bd8/attachment-0001.html>

From charlesr.harris at gmail.com  Fri Apr 21 13:11:44 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Fri, 21 Apr 2017 11:11:44 -0600
Subject: [Numpy-discussion] __array_ufunc__ final review
Message-ID: <CAB6mnxJ2vrE_Wzh7opGdoWc59Rkn=yD7deuKyACcrCFd98TYfA@mail.gmail.com>

Hi All,

The __array_ufunc__ <https://github.com/numpy/numpy/pull/8247> PR is ready
for final review. If there are no complaints, I plan to put it in tomorrow.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170421/4a5b14f2/attachment.html>

From chris.barker at noaa.gov  Fri Apr 21 14:34:26 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Fri, 21 Apr 2017 11:34:26 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <8741041756854148453@unknownmsgid>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
Message-ID: <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>

I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.

2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python?s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
for that.
"""

My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
flle formats.

This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
interface code)

If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.

This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.

For Python use -- a pointer to a Python string would be nice.

Then use a native flexible-encoding dtype for everything else.

Thinking out loud -- another option would be to set defaults for the
multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
with the python string type -- and make folks make an effort to get
anything else.

One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:

EncodingError if it can't be encoded into the defined encoding.

ValueError if it is too long -- it should not be silently truncated.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170421/1652ab34/attachment.html>

From shoyer at gmail.com  Fri Apr 21 17:34:27 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Fri, 21 Apr 2017 14:34:27 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
Message-ID: <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>

On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> 1) Use with/from  Python -- both creating and working with numpy arrays.
>

> In this case, we want something compatible with Python's string (i.e. full
> Unicode supporting) and I think should be as transparent as possible.
> Python's string has made the decision to present a character oriented API
> to users (despite what the manifesto says...).
>

Yes, but NumPy doesn't really implement string operations, so fortunately
this is pretty irrelevant to us -- except for our API for specifying dtype
size.

We already have strong precedence for dtypes reflecting number of bytes
used for storage even when Python doesn't: consider numeric types like
int64 and float32 compared to the Python equivalents. It's an intrinsic
aspect of NumPy that users need to think about how their data is actually
stored.


> However, there is a challenge here: numpy requires fixed-number-of-bytes
> dtypes. And full unicode support with fixed number of bytes matching fixed
> number of characters is only possible with UCS-4 -- hence the current
> implementation. And this is actually just fine! I know we all want to be
> efficient with data storage, but really -- in the early days of Unicode,
> when folks thought 16 bits were enough, doubling the memory usage for
> western language storage was considered fine -- how long in computer life
> time does it take to double your memory? But now, when memory, disk space,
> bandwidth, etc, are all literally orders of magnitude larger, we can't
> handle a factor of 4 increase in "wasted" space?
>

Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving
faster than storage.


> But as scientific text data often is 1-byte compatible, a
> one-byte-per-char dtype is a fine idea, too -- and we pretty much have that
> already with the existing string type -- that could simply be enhanced by
> enforcing the encoding to be latin-9 (or latin-1, if you don't want the
> Euro symbol). This would get us what scientists expect from strings in a
> way that is properly compatible with Python's string type. You'd get
> encoding errors if you tried to stuff anything else in there, and that's
> that.
>

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

So -- I think we should address the use-cases separately -- one for
> "normal" python use and simple interoperability with python strings, and
> one for interoperability at the binary level. And an easy way to convert
> between the two.
>
> For Python use -- a pointer to a Python string would be nice.
>

Yes, absolutely. If we want to be really fancy, we could consider a
parametric object dtype that allows for object arrays of *any* homogeneous
Python type. Even if NumPy itself doesn't do anything with that
information, there are lots of use cases for that information.

Then use a native flexible-encoding dtype for everything else.
>

No opposition here from me. Though again, I think utf-8 alone would also be
enough.


> Thinking out loud -- another option would be to set defaults for the
> multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
> with the python string type -- and make folks make an effort to get
> anything else.
>

The np.unicode_ type is already UCS-4 and the default for dtype=str on
Python 3. We probably shouldn't change that, but if we set any default
encoding for the new text type, I strongly believe it should be utf-8.

One more note: if a user tries to assign a value to a numpy string array
> that doesn't fit, they should get an error:
>
> EncodingError if it can't be encoded into the defined encoding.
>
> ValueError if it is too long -- it should not be silently truncated.
>

I think we all agree here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170421/e49a0f68/attachment-0001.html>

From chris.barker at noaa.gov  Mon Apr 24 13:04:53 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Mon, 24 Apr 2017 10:04:53 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
Message-ID: <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>

On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:


> In this case, we want something compatible with Python's string (i.e. full
>> Unicode supporting) and I think should be as transparent as possible.
>> Python's string has made the decision to present a character oriented API
>> to users (despite what the manifesto says...).
>>
>
> Yes, but NumPy doesn't really implement string operations, so fortunately
> this is pretty irrelevant to us -- except for our API for specifying dtype
> size.
>

Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string,
and maybe only if there are more than N non-ascii characters. i.e. it is
very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they
need to truncate it, and THAT is non-obvious how to do, too -- you don't
want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.


> We already have strong precedence for dtypes reflecting number of bytes
> used for storage even when Python doesn't: consider numeric types like
> int64 and float32 compared to the Python equivalents. It's an intrinsic
> aspect of NumPy that users need to think about how their data is actually
> stored.
>

sure, but a float64 is 64 bytes forever an always and the defaults
perfectly match what python is doing under its hood --even if users don't
think about. So the default behaviour of numpy matched python's built-in
types.


Storage cost is always going to be a concern. Arguably, it's even more of a
>> concern today than it used to be be, because compute has been improving
>> faster than storage.
>>
>
sure -- but again, what is the use-case for numpy arrays with a s#$)load of
text in them? common? I don't think so. And as you pointed out numpy
doesn't do text processing anyway, so cache performance and all that are
not important. So having UCS-4 as the default, but allowing folks to select
a more compact format if they really need it is a good way to go. Just like
numpy generally defaults to float64 and Int64 (or 32, depending on
platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should
default to Python-like behavior as much as possible for strings, too --
with an option for a knowledgeable user to do something more performant.


> I still don't understand why a latin encoding makes sense as a preferred
> one-byte-per-char dtype. The world, including Python 3, has standardized on
> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.

For Python use -- a pointer to a Python string would be nice.
>>
>
> Yes, absolutely. If we want to be really fancy, we could consider a
> parametric object dtype that allows for object arrays of *any* homogeneous
> Python type. Even if NumPy itself doesn't do anything with that
> information, there are lots of use cases for that information.
>

hmm -- that's nifty idea -- though I think strings could/should be special
cased.


> Then use a native flexible-encoding dtype for everything else.
>>
>
> No opposition here from me. Though again, I think utf-8 alone would also
> be enough.
>

maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array
>> that doesn't fit, they should get an error:
>>
>
>> EncodingError if it can't be encoded into the defined encoding.
>>
>> ValueError if it is too long -- it should not be silently truncated.
>>
>
> I think we all agree here.
>

I'm actually having second thoughts -- see above -- if the encoding is
utf-8, then truncating is non-trivial -- maybe it would be better for numpy
to do it for you. Or set a flag as to which you want?

The current 'S' dtype truncates silently already:

In [6]: arr

Out[6]:
array(['this', 'that'],
      dtype='|S4')

In [7]: arr[0] = "a longer string"

In [8]: arr

Out[8]:
array(['a lo', 'that'],
      dtype='|S4')

(similarly for the unicode type)

So at least we are used to that.

BTW -- maybe we should keep the pathological use-case in mind: really short
strings. I think we are all thinking in terms of longer strings, maybe a
name field, where you might assign 32 bytes or so -- then someone has an
accented character in their name, and then ge30 or 31 characters -- no big
deal.

But what if you have a simple label or something with 1 or two characters:
Then you have 2 bytes to store the name in, and someone tries to put an
"odd" character in there, and you get an empty string. not good.

Also -- if utf-8 is the default -- what do you get when you create an array
from a python string sequence? Currently with the 'S' and 'U' dtypes, the
dtype is set to the longest string passed in. Are we going to pad it a bit?
stick with the exact number of bytes?

It all comes down to this:

Python3 has made a very deliberate (and I think Good) choice to treat text
as a string of characters, where the user does not need to know or care
about encoding issues. Numpy's defaults should do the same thing.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/0ef22af5/attachment.html>

From shoyer at gmail.com  Mon Apr 24 13:51:55 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 24 Apr 2017 10:51:55 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
Message-ID: <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> latin-1 or latin-9 buys you (over ASCII):
>
> ...
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.


> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>

Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
>

One advantage of a new (non-default) dtype is that we can change this
behavior.


> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>

It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/9ff5ca44/attachment-0001.html>

From aldcroft at head.cfa.harvard.edu  Mon Apr 24 13:51:55 2017
From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas)
Date: Mon, 24 Apr 2017 13:51:55 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
Message-ID: <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>
>> In this case, we want something compatible with Python's string (i.e.
>>> full Unicode supporting) and I think should be as transparent as possible.
>>> Python's string has made the decision to present a character oriented API
>>> to users (despite what the manifesto says...).
>>>
>>
>> Yes, but NumPy doesn't really implement string operations, so fortunately
>> this is pretty irrelevant to us -- except for our API for specifying dtype
>> size.
>>
>
> Exactly -- the character-orientation of python strings means that people
> are used to thinking that strings have a length that is the number of
> characters in the string. I think there will a cognitive dissonance if
> someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.
>
> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the string,
> and maybe only if there are more than N non-ascii characters. i.e. it is
> very likely to be a run-time error that may not have shown up in tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
> they need to truncate it, and THAT is non-obvious how to do, too -- you
> don't want to truncate the encodes bytes naively, you could end up with an
> invalid bytestring. but you don't know how many characters to truncate,
> either.
>
>
>> We already have strong precedence for dtypes reflecting number of bytes
>> used for storage even when Python doesn't: consider numeric types like
>> int64 and float32 compared to the Python equivalents. It's an intrinsic
>> aspect of NumPy that users need to think about how their data is actually
>> stored.
>>
>
> sure, but a float64 is 64 bytes forever an always and the defaults
> perfectly match what python is doing under its hood --even if users don't
> think about. So the default behaviour of numpy matched python's built-in
> types.
>
>
> Storage cost is always going to be a concern. Arguably, it's even more of
>>> a concern today than it used to be be, because compute has been improving
>>> faster than storage.
>>>
>>
> sure -- but again, what is the use-case for numpy arrays with a s#$)load
> of text in them? common? I don't think so. And as you pointed out numpy
> doesn't do text processing anyway, so cache performance and all that are
> not important. So having UCS-4 as the default, but allowing folks to select
> a more compact format if they really need it is a good way to go. Just like
> numpy generally defaults to float64 and Int64 (or 32, depending on
> platform) -- users can select a smaller size if they have a reason to.
>
> I guess that's my summary -- just like with numeric values, numpy should
> default to Python-like behavior as much as possible for strings, too --
> with an option for a knowledgeable user to do something more performant.
>
>
>> I still don't understand why a latin encoding makes sense as a preferred
>> one-byte-per-char dtype. The world, including Python 3, has standardized on
>> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>>
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
> data are one-byte per char, then you could use ASCII, and it would be
> binary compatible with utf-8, but not sure what the point of that is in
> this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
> languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
> few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

+1.  The key point is that there is a HUGE amount of legacy science data in
the form of FITS (astronomy-specific binary file format that has been the
primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.


>
> For Python use -- a pointer to a Python string would be nice.
>>>
>>
>> Yes, absolutely. If we want to be really fancy, we could consider a
>> parametric object dtype that allows for object arrays of *any* homogeneous
>> Python type. Even if NumPy itself doesn't do anything with that
>> information, there are lots of use cases for that information.
>>
>
> hmm -- that's nifty idea -- though I think strings could/should be special
> cased.
>
>
>> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>
> One more note: if a user tries to assign a value to a numpy string array
>>> that doesn't fit, they should get an error:
>>>
>>
>>> EncodingError if it can't be encoded into the defined encoding.
>>>
>>> ValueError if it is too long -- it should not be silently truncated.
>>>
>>
>> I think we all agree here.
>>
>
> I'm actually having second thoughts -- see above -- if the encoding is
> utf-8, then truncating is non-trivial -- maybe it would be better for numpy
> to do it for you. Or set a flag as to which you want?
>
> The current 'S' dtype truncates silently already:
>
> In [6]: arr
>
> Out[6]:
> array(['this', 'that'],
>       dtype='|S4')
>
> In [7]: arr[0] = "a longer string"
>
> In [8]: arr
>
> Out[8]:
> array(['a lo', 'that'],
>       dtype='|S4')
>
> (similarly for the unicode type)
>
> So at least we are used to that.
>
> BTW -- maybe we should keep the pathological use-case in mind: really
> short strings. I think we are all thinking in terms of longer strings,
> maybe a name field, where you might assign 32 bytes or so -- then someone
> has an accented character in their name, and then ge30 or 31 characters --
> no big deal.
>

I wouldn't call it a pathological use case, it doesn't seem so uncommon to
have large datasets of short strings.  I personally deal with a database of
hundreds of billions of 2 to 5 character ASCII strings.  This has been a
significant blocker to Python 3 adoption in my world.

BTW, for those new to the list or with a short memory, this topic has been
discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!

https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html
https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html

- Tom


>

>
> But what if you have a simple label or something with 1 or two characters:
> Then you have 2 bytes to store the name in, and someone tries to put an
> "odd" character in there, and you get an empty string. not good.
>

> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>
> It all comes down to this:
>
> Python3 has made a very deliberate (and I think Good) choice to treat text
> as a string of characters, where the user does not need to know or care
> about encoding issues. Numpy's defaults should do the same thing.
>
> -CHB
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/c0f1c402/attachment-0001.html>

From chris.barker at noaa.gov  Mon Apr 24 14:13:51 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Mon, 24 Apr 2017 11:13:51 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
Message-ID: <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>

On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer <shoyer at gmail.com> wrote:

> - round-tripping of binary data (at least with Python's encoding/decoding)
>> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
>> same bytes back. You may get garbage, but you won't get an EncodingError.
>>
>
> For a new application, it's a good thing if a text type breaks when you to
> stuff arbitrary bytes in it
>

maybe, maybe not -- the application may be new, but the data it works with
may not be.


> (see Python 2 vs Python 3 strings).
>

this is exactly why py3 strings needed to add the "surrogateescape" error
handler:

https://www.python.org/dev/peps/pep-0383

sometimes text and binary data are mixed, sometimes encoded text is broken.
It is very useful to be able to pass such data through strings losslessly.

Certainly, I would argue that nobody should write data in latin-1 unless
> they're doing so for the sake of a legacy application.
>

or you really want that 1-byte per char efficiency


> I do understand the value in having some "string" data type that could be
> used by default by loaders for legacy file formats/applications (i.e.,
> netCDF3) that support unspecified "one byte strings." Then you're a few
> short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
> if we support arbitrary encodings) or decoding (i.e.,
> np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
> proper encoding. It's not realistic to expect users to know the true
> encoding for strings from a file before they even look at the data.
>

except that you really should :-(

On the other hand, if this is the use-case, perhaps we really want an
> encoding closer to "Python 2" string, i.e, "unknown", to let this be
> signaled more explicitly. I would suggest that "text[unknown]" should
> support operations like a string if it can be decoded as ASCII, and
> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
> bytes.
>

I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
is ascii, then it's perfect. If it really is latin-*, then you get some
extra useful stuff, and if it's corrupted somehow, you still get the ascii
text correct, and the rest won't  barf and can be passed on through.


So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
> "unknown".
>

hmm -- "unknown" should be bytes, not text. If the user needs to look at it
first, then load it as bytes, run chardet or something on it, then cast to
the right encoding.

The current 'S' dtype truncates silently already:
>>
>
> One advantage of a new (non-default) dtype is that we can change this
> behavior.
>

yeah -- still on the edge about that, at least with variable-size
encodings. It's hard to know when it's going to happen and it's hard to
know what to do when it does.

At least if if truncates silently, numpy can have the code to do the
truncation properly. Maybe an option?

And the numpy numeric types truncate (Or overflow) already. Again:

If the default string handling matches expectations from python strings,
then the specialized ones can be more buyer-beware.

Also -- if utf-8 is the default -- what do you get when you create an array
>> from a python string sequence? Currently with the 'S' and 'U' dtypes, the
>> dtype is set to the longest string passed in. Are we going to pad it a bit?
>> stick with the exact number of bytes?
>>
>
> It might be better to avoid this for now, and force users to be explicit
> about encoding if they use the dtype for encoded text.
>

yup.

And we really should have a bytes type for py3 -- which we do, it's just
called 'S', which is pretty confusing :-)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/17e1335e/attachment.html>

From chris.barker at noaa.gov  Mon Apr 24 14:21:48 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Mon, 24 Apr 2017 11:21:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
Message-ID: <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>

On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

> BTW -- maybe we should keep the pathological use-case in mind: really
>> short strings. I think we are all thinking in terms of longer strings,
>> maybe a name field, where you might assign 32 bytes or so -- then someone
>> has an accented character in their name, and then ge30 or 31 characters --
>> no big deal.
>>
>
> I wouldn't call it a pathological use case, it doesn't seem so uncommon to
> have large datasets of short strings.
>

It's pathological for using a variable-length encoding.


> I personally deal with a database of hundreds of billions of 2 to 5
> character ASCII strings.  This has been a significant blocker to Python 3
> adoption in my world.
>

I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

BTW, for those new to the list or with a short memory, this topic has been
> discussed fairly extensively at least 3 times before.  Hopefully the
> *fourth* time will be the charm!
>

yes, let's hope so!

The big difference now is that Julian seems to be committed to actually
making it happen!

Thanks Julian!

Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.

I have strong opinions, but would still rather see any of the ideas on the
table implemented than nothing.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/6621b043/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 14:36:15 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 11:36:15 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
Message-ID: <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>

On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>
> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>>>
>>> BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings,
maybe a name field, where you might assign 32 bytes or so -- then someone
has an accented character in their name, and then ge30 or 31 characters --
no big deal.
>>
>>
>> I wouldn't call it a pathological use case, it doesn't seem so uncommon
to have large datasets of short strings.
>
> It's pathological for using a variable-length encoding.
>
>> I personally deal with a database of hundreds of billions of 2 to 5
character ASCII strings.  This has been a significant blocker to Python 3
adoption in my world.
>
> I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

>> BTW, for those new to the list or with a short memory, this topic has
been discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!
>
> yes, let's hope so!
>
> The big difference now is that Julian seems to be committed to actually
making it happen!
>
> Thanks Julian!
>
> Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
>
> I have strong opinions, but would still rather see any of the ideas on
the table implemented than nothing.

FWIW, I prefer nothing to just adding a special case for latin-1. Solve the
HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone
else is willing to solve that problem. I don't think we're at the
bikeshedding stage yet; we're still disagreeing about fundamental
requirements.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/97eb6206/attachment-0001.html>

From robert.kern at gmail.com  Mon Apr 24 14:47:21 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 11:47:21 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
Message-ID: <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>

On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov>
wrote:

>> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>
> +1.  The key point is that there is a HUGE amount of legacy science data
in the form of FITS (astronomy-specific binary file format that has been
the primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.

That says to me that these are properly represented by `bytes` objects, not
`unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/bd0caca8/attachment.html>

From aldcroft at head.cfa.harvard.edu  Mon Apr 24 14:56:55 2017
From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas)
Date: Mon, 24 Apr 2017 14:56:55 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
Message-ID: <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
> aldcroft at head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov>
> wrote:
>
> >> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> >
> > +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.
>
> That says to me that these are properly represented by `bytes` objects,
> not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.
>

If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

- Tom


>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/c3448a68/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 15:09:06 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 12:09:06 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
Message-ID: <CAF6FJiuug3NQFkQznBgG_M0JOsS3_m6jdL+KfQZdTGfUv59FAA@mail.gmail.com>

On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>>> In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for
specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the
string, and maybe only if there are more than N non-ascii characters. i.e.
it is very likely to be a run-time error that may not have shown up in
tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
they need to truncate it, and THAT is non-obvious how to do, too -- you
don't want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

If this becomes the right strategy for dealing with these problems (and I'm
not sure that it is), we can easily make a utility function that does this
for people.

This discussion is why I want to be sure that we have our use cases
actually mapped out. For this kind of in-memory manipulation, I'd use an
object array (a la pandas), then convert to the uniform-width string dtype
when I needed to push this out to a C API, HDF5 file, or whatever actually
requires a string-dtype array. The required width gets computed from the
data after all of the manipulations are done. Doing in-memory assignments
to a fixed-encoding, fixed-width string dtype will always have this kind of
problem. You should only put up with it if you have a requirement to write
to a format that specifies the width and the encoding. That specified
encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that
your data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/76821bbe/attachment-0001.html>

From robert.kern at gmail.com  Mon Apr 24 16:06:06 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 13:06:06 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
 <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
Message-ID: <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>

On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov>
wrote:
>>
>> >> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> >
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.

Can you walk us through the problems that you are having with working with
these columns as arrays of `bytes`?

> So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

Well, I outlined a solution: work with `bytes` arrays with utilities to
convert to/from the Unicode-aware string dtypes (or `object`).

A UTF-8-specific dtype and maybe a string-specialized `object` dtype
address the very real and consequential problems that I face (namely and
respectively, working with HDF5 and in-memory manipulation of string
datasets).

I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-
been-warned-you're-gonna-get-mojibake option. It should not be *the*
Unicode string dtype (i.e. named np.realstring or np.unicode as in the
original proposal).

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/2a34561a/attachment.html>

From chris.barker at noaa.gov  Mon Apr 24 17:00:13 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Mon, 24 Apr 2017 14:00:13 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
Message-ID: <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>

On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern at gmail.com> wrote:

> > I agree -- it is a VERY common case for scientific data sets. But a
> one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
> Unicode. The wasted space is not that big a deal with short strings...
>
> Unless if you have hundreds of billions of them.
>

Which is why a one-byte-per char encoding is a good idea.

Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>

I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?


> or leave it be until someone else is willing to solve that problem. I
> don't think we're at the bikeshedding stage yet; we're still disagreeing
> about fundamental requirements.
>

yeah -- though I've seen projects get stuck in the sorting out what to do,
so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

So here I'll lay out what I think are the fundamental requirements:

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:

arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters

and arr[1] will return a native Python string object.

2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.

I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the encoding
in this case.

3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
 - you could use astype() to convert between bytes and a specified encoding
with no change in binary representation.

2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.

1) could be covered with the existing 'U': type - only downside being some
wasted space -- or with a pointer to a python string dtype -- which would
also waste space, though less for long-ish strings, and maybe give us some
better access to the nifty built-in string features.

> +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.


That says to me that these are properly represented by `bytes` objects, not
> `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.


Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
> really important use case for me.


latin-1 would be only for the special case of mostly-ascii (or true latin)
one-byte-per-char encodings (which is a common use-case in scientific data
sets). I think it has only upside over ascii. It would be a fine idea to
support any one-byte-per-char encoding, too.

As for external data in utf-8 -- yes that should be dealt with properly --
either by truly supporting utf-8 internally, or by properly
encoding/decoding when putting it in and  moving it out of an array.

utf-8 is a very important encoding -- I just think it's the wrong one for
the default interplay with python strings.

 Doing in-memory assignments to a fixed-encoding, fixed-width string dtype
> will always have this kind of problem. You should only put up with it if
> you have a requirement to write to a format that specifies the width and
> the encoding. That specified encoding is frequently not latin-1!
>

of course not -- if you are writing to a format that specifies a width and
the encoding, you want o use bytes :-) -- or a dtype that is properly
encoding-aware. I was not suggesting that latin-1 be used for arbitrary
bytes -- that is what bytes are for.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
>


> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array?


of course not -- see above.

I'm happy to consider a latin-1-specific dtype as a second,
> workaround-for-specific-applications-only-you-have-been-
> warned-you're-gonna-get-mojibake option.


well, it wouldn't create mojibake - anything that went from a python string
to a latin-1 array would be properly encoded in latin-1 -- unless is came
from already corrupted data. but when you have corrupted data, your only
choices are to:

 - raise an error
 - alter the data (error-"replace")
 - pass the corrupted data on through.

but it could deal with mojibake -- that's the whole point :-)


> It should not be *the* Unicode string dtype (i.e. named np.realstring or
> np.unicode as in the original proposal).


God no -- sorry if it looked like I was suggesting that. I only suggest
that it might be *the* one-byte-per-char string type

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/2fc20ea1/attachment-0001.html>

From aldcroft at head.cfa.harvard.edu  Mon Apr 24 19:06:56 2017
From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas)
Date: Mon, 24 Apr 2017 19:06:56 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
 <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
 <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>
Message-ID: <CAMtEP6waP5sJpz9gyUQ+GW1hhRtdmAeAnYWi7-LwTGypDa=utg@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com> wrote:

> I am not unfamiliar with this problem. I still work with files that have
> fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
>

> Can you walk us through the problems that you are having with working with
> these columns as arrays of `bytes`?
>

This is very simple and obvious but I will state for the record.  Reading
an HDF5 file with character data currently gives arrays of `bytes` [1].  In
Py3 this cannot be compared to a string literal, and comparing to (or
assigning from) explicit byte strings everywhere in the code quickly spins
out of control.  This generally forces one to convert the data to `U` type
and incur the 4x memory bloat.

In [22]: dat = np.array(['yes', 'no'], dtype='S3')

In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
Out[23]: False

In [24]: dat == b'yes'  # Right answer but not practical
Out[24]: array([ True, False], dtype=bool)

- Tom

[1]: Using h5py or pytables.  Same with FITS, although astropy.io.fits does
some tricks under the hood to auto-convert to `U` type as needed.


>
>
> > So I would beg to actually move forward with a pragmatic solution that
> addresses very real and consequential problems that we face instead of
> waiting/praying for a perfect solution.
>
> Well, I outlined a solution: work with `bytes` arrays with utilities to
> convert to/from the Unicode-aware string dtypes (or `object`).
>
> A UTF-8-specific dtype and maybe a string-specialized `object` dtype
> address the very real and consequential problems that I face (namely and
> respectively, working with HDF5 and in-memory manipulation of string
> datasets).
>
> I'm happy to consider a latin-1-specific dtype as a second,
> workaround-for-specific-applications-only-you-have-been-
> warned-you're-gonna-get-mojibake option. It should not be *the* Unicode
> string dtype (i.e. named np.realstring or np.unicode as in the original
> proposal).
>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/e735146e/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 19:08:50 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 16:08:50 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
Message-ID: <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>

Chris, you've mashed all of my emails together, some of them are in reply
to you, some in reply to others. Unfortunately, this dropped a lot of the
context from each of them, and appears to be creating some
misunderstandings about what each person is advocating.

On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker <chris.barker at noaa.gov> wrote:
>
> On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern at gmail.com>
wrote:

>> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>
> I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?

If the point is to have an array that transparently accepts/yields
`unicode/str` scalars while maintaining the in-memory encoding, yes. If
that's not the point, then IMO the status quo is fine, and *no* new dtypes
should be added, just maybe some utility functions to convert between the
bytes-ish arrays and the Unicode-holding arrays (which was one of my
proposals). I am mostly happy to live in a world where I read in data as
bytes-ish arrays, decode into `object` arrays holding `unicode/str`
objects, do my manipulations, then encode the array into a bytes-ish array
to give to the C API or file format.

>> or leave it be until someone else is willing to solve that problem. I
don't think we're at the bikeshedding stage yet; we're still disagreeing
about fundamental requirements.
>
> yeah -- though I've seen projects get stuck in the sorting out what to
do, so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

I understand, but not all tedious discussions that have not yet achieved
consensus are bikeshedding to be cut short. We couldn't really decide what
to do back in the pre-1.0 days, too, so we just did *something*, and that
something is now the very situation that Julian has a problem with.

We have more experience now, especially with the added wrinkles of Python
3; other projects have advanced and matured their Unicode string
array-handling (e.g. pandas and HDF5); now is a great time to have a real
discussion about what we *need* before we make decisions about what we
should *do*.

> So here I'll lay out what I think are the fundamental requirements:
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:
>
> arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
characters
>
> and arr[1] will return a native Python string object.
>
> 2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.
>
> I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the
encoding in this case.
>
> 3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)
>
> 4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
>  - you could use astype() to convert between bytes and a specified
encoding with no change in binary representation.

You'll need to specify what NULL-terminating behavior you want here.
np.string_ has NULL-termination. np.void (which could be made to work
better with `bytes`) does not. Both have use-cases for text encoding
(shakes fist at UTF-16).

> 2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.
>
> 1) could be covered with the existing 'U': type - only downside being
some wasted space -- or with a pointer to a python string dtype -- which
would also waste space, though less for long-ish strings, and maybe give us
some better access to the nifty built-in string features.
>
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Walk me through a problem that you've encountered with such textish data in
arrays. I know the problems in Web protocol-land, but they are not really
relevant to us. What are *your* problems? Why didn't those ameliorations
that were added for the Web world address your problems? I really want to
get at specific use cases that interact with numpy, not handwaving at
problems other people have had in other contexts.

> Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.
>
>> > - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.
>
> latin-1 would be only for the special case of mostly-ascii (or true
latin) one-byte-per-char encodings (which is a common use-case in
scientific data sets). I think it has only upside over ascii. It would be a
fine idea to support any one-byte-per-char encoding, too.

In my experience, it has both upside and downside. Silently creating
mojibake is a problem. The process that you described, decoding ANY strings
of bytes as latin-1, can create mojibake. The inverse, encoding then
decoding, may not, but of course the encoding step there does not accept
arbitrary Unicode strings.

> As for external data in utf-8 -- yes that should be dealt with properly
-- either by truly supporting utf-8 internally, or by properly
encoding/decoding when putting it in and  moving it out of an array.
>
> utf-8 is a very important encoding -- I just think it's the wrong one for
the default interplay with python strings.
>
>>  Doing in-memory assignments to a fixed-encoding, fixed-width string
dtype will always have this kind of problem. You should only put up with it
if you have a requirement to write to a format that specifies the width and
the encoding. That specified encoding is frequently not latin-1!
>
> of course not -- if you are writing to a format that specifies a width
and the encoding, you want o use bytes :-) -- or a dtype that is properly
encoding-aware. I was not suggesting that latin-1 be used for arbitrary
bytes -- that is what bytes are for.

Ah, your message was responding to Stephan who questioned why latin-1
should be the default encoding for the `unicode/str`-aware string dtype. It
seemed like you were affirming that latin-1 ought to be that default. It
seems like that is not your position, but you are defending the existence
of a latin-1 dtype for specific uses.

>> I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake
option.
>
> well, it wouldn't create mojibake - anything that went from a python
string to a latin-1 array would be properly encoded in latin-1 -- unless is
came from already corrupted data. but when you have corrupted data, your
only choices are to:
>
>  - raise an error
>  - alter the data (error-"replace")
>  - pass the corrupted data on through.
>
> but it could deal with mojibake -- that's the whole point :-)

You are right that assigning a `unicode/str` object into my latin-1-dtype
array would not create mojibake, but that's not the only way to fill a
numpy array.

In the context of my email, I was responding to a use case being floated
for the latin-1 dtype that was to read existing FITS files that have fields
that are text-ish: plain octets according to the file format standard, but
in practice mostly ASCII with a few sparse high-bit characters typically
from some unspecified iso-8859-* encoding. If that unspecified encoding
wasn't latin-1, then I'm getting mojibake when I read the file (unless if,
happy days, the author of the file was also using latin-1).

I understand that you are proposing a latin-1 dtype in a context with other
dtypes and tools that might make that use of the latin-1 dtype obsolete.
However, there are others who have been proposing just a latin-1 dtype for
this purpose.

Let me make a counter-proposal for your latin-1 dtype (your #2) that might
address your, Thomas's, and Julian's use cases:

2) We want a single-byte-per-character, NULL-terminated string dtype that
can be used to represent mostly-ASCII textish data that may have some
high-bit characters from some 8-bit encoding. It should be able to read
arbitrary bytes (that is, up to the NULL-termination) and write them back
out as the same bytes if unmodified. This lets us read this text from files
where the encoding is unspecified (or is lying about the encoding) into
`unicode/str` objects. The encoding is specified as `ascii` but the
decoding/encoding is done with the `surrogateescape` option so that
high-bit characters are faithfully represented in the `unicode/str` string
but are not erroneously reinterpreted as other characters from an arbitrary
encoding.

I'd even be happy if Julian or someone wants to go ahead and implement this
right now and leave the UTF-8 dtype for a later time.

As long as this ASCII-surrogateescape dtype is not called np.realstring
(it's *really* important to me that the bikeshed not be this color). ;-)

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/7d97105e/attachment-0001.html>

From shoyer at gmail.com  Mon Apr 24 19:09:48 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 24 Apr 2017 16:09:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
Message-ID: <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>

On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> On the other hand, if this is the use-case, perhaps we really want an
>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>> signaled more explicitly. I would suggest that "text[unknown]" should
>> support operations like a string if it can be decoded as ASCII, and
>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>> bytes.
>>
>
> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
> is ascii, then it's perfect. If it really is latin-*, then you get some
> extra useful stuff, and if it's corrupted somehow, you still get the ascii
> text correct, and the rest won't  barf and can be passed on through.
>

I am totally in agreement with Thomas that "We are living in a messy world
right now with messy legacy datasets that have character type data that are
*mostly* ASCII, but not infrequently contain non-ASCII characters."

My question: What are those non-ASCII characters? How often are they truly
latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't think that silently (mis)interpreting non-ASCII characters as
latin-1/9 is a good idea, which is why I think it would be a mistake to use
'latin-1' for text data with unknown encoding.

I could get behind a data type that compares equal to strings for ASCII
only and allows for *storing* other characters, but making blind
assumptions about characters 128-255 seems like a recipe for disaster.
Imagine text[unknown] as a one character string type, but it supports
.decode() like bytes and every character in the range 128-255 compares for
equality with other characters like NaN -- not even equal to itself.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/2a46f7a2/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 19:11:25 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 16:11:25 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMtEP6waP5sJpz9gyUQ+GW1hhRtdmAeAnYWi7-LwTGypDa=utg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
 <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
 <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>
 <CAMtEP6waP5sJpz9gyUQ+GW1hhRtdmAeAnYWi7-LwTGypDa=utg@mail.gmail.com>
Message-ID: <CAF6FJiuxL2oKzV9+hLkvEeFjmd39QWJMYFrhg4_Z3st4PYYb9g@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>>
>> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>
> This is very simple and obvious but I will state for the record.

I appreciate it. What is obvious to you is not obvious to me.

> Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>
> In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>
> In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> Out[23]: False
>
> In [24]: dat == b'yes'  # Right answer but not practical
> Out[24]: array([ True, False], dtype=bool)

I'm curious why you think this is not practical. It seems like a very
practical solution to me.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/578debf6/attachment.html>

From shoyer at gmail.com  Mon Apr 24 19:19:16 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 24 Apr 2017 16:19:16 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
Message-ID: <CAEQ_TveUV8pYS9qPKny49OaJOc+RcD06C_Ey_v4NQxKLzBJWBQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern at gmail.com> wrote:

> Let me make a counter-proposal for your latin-1 dtype (your #2) that might
> address your, Thomas's, and Julian's use cases:
>
> 2) We want a single-byte-per-character, NULL-terminated string dtype that
> can be used to represent mostly-ASCII textish data that may have some
> high-bit characters from some 8-bit encoding. It should be able to read
> arbitrary bytes (that is, up to the NULL-termination) and write them back
> out as the same bytes if unmodified. This lets us read this text from files
> where the encoding is unspecified (or is lying about the encoding) into
> `unicode/str` objects. The encoding is specified as `ascii` but the
> decoding/encoding is done with the `surrogateescape` option so that
> high-bit characters are faithfully represented in the `unicode/str` string
> but are not erroneously reinterpreted as other characters from an arbitrary
> encoding.
>
> I'd even be happy if Julian or someone wants to go ahead and implement
> this right now and leave the UTF-8 dtype for a later time.
>
> As long as this ASCII-surrogateescape dtype is not called np.realstring
> (it's *really* important to me that the bikeshed not be this color). ;-)
>

This sounds quite similar to my text[unknown] proposal, with the advantage
that the concept of "surrogateescape" that already exists. Surrogate-escape
characters compare equal to themselves, which is maybe less than ideal, but
it looks like you can put them in real unicode strings, which is nice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/a477be1a/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 19:23:37 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 16:23:37 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
Message-ID: <CAF6FJivfSKDyW1_Q6F5sqEBMOhTaZWkqAKk59t4_KSBPaZ2LUQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>>>
>>> On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
really is ascii, then it's perfect. If it really is latin-*, then you get
some extra useful stuff, and if it's corrupted somehow, you still get the
ascii text correct, and the rest won't  barf and can be passed on through.
>
> I am totally in agreement with Thomas that "We are living in a messy
world right now with messy legacy datasets that have character type data
that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't know that we can reasonably make that accounting relevant. Number
of such characters per byte of text? Number of files with such characters
out of all existing files?

What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have
regretted it. While it's just personal anecdote, I think it's at least
measuring the right thing. :-)

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/02d60b79/attachment-0001.html>

From aldcroft at head.cfa.harvard.edu  Mon Apr 24 20:56:43 2017
From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas)
Date: Mon, 24 Apr 2017 20:56:43 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuxL2oKzV9+hLkvEeFjmd39QWJMYFrhg4_Z3st4PYYb9g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
 <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
 <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>
 <CAMtEP6waP5sJpz9gyUQ+GW1hhRtdmAeAnYWi7-LwTGypDa=utg@mail.gmail.com>
 <CAF6FJiuxL2oKzV9+hLkvEeFjmd39QWJMYFrhg4_Z3st4PYYb9g@mail.gmail.com>
Message-ID: <CAMtEP6zHFWoeevqD4RiZC3=2MYDjCFFrYCHWoeC8bbaDKpaM6g@mail.gmail.com>

On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
> aldcroft at head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> I am not unfamiliar with this problem. I still work with files that
> have fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
> >>
> >> Can you walk us through the problems that you are having with working
> with these columns as arrays of `bytes`?
> >
> > This is very simple and obvious but I will state for the record.
>
> I appreciate it. What is obvious to you is not obvious to me.
>
> > Reading an HDF5 file with character data currently gives arrays of
> `bytes` [1].  In Py3 this cannot be compared to a string literal, and
> comparing to (or assigning from) explicit byte strings everywhere in the
> code quickly spins out of control.  This generally forces one to convert
> the data to `U` type and incur the 4x memory bloat.
> >
> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
> >
> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> > Out[23]: False
> >
> > In [24]: dat == b'yes'  # Right answer but not practical
> > Out[24]: array([ True, False], dtype=bool)
>
> I'm curious why you think this is not practical. It seems like a very
> practical solution to me.
>

In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

- Tom


>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/f0cb6257/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 21:36:46 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 18:36:46 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMtEP6zHFWoeevqD4RiZC3=2MYDjCFFrYCHWoeC8bbaDKpaM6g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CAF6FJivP=g0--gq+gbiSfTFu2_ySE_=VK92U-cTefbcFzdQOkw@mail.gmail.com>
 <CAMtEP6wToyMFTBuLg+Z7BrnqK3Pr242J5A-iAnFWbE+srKq7dQ@mail.gmail.com>
 <CAF6FJiusgC-iHc1WN0CE_hVVJottEz+ty=KtX69C2JZD3nq6bA@mail.gmail.com>
 <CAMtEP6waP5sJpz9gyUQ+GW1hhRtdmAeAnYWi7-LwTGypDa=utg@mail.gmail.com>
 <CAF6FJiuxL2oKzV9+hLkvEeFjmd39QWJMYFrhg4_Z3st4PYYb9g@mail.gmail.com>
 <CAMtEP6zHFWoeevqD4RiZC3=2MYDjCFFrYCHWoeC8bbaDKpaM6g@mail.gmail.com>
Message-ID: <CAF6FJitA6bPjkr5XR63A9t_az=OdDxeszRpZa06Jz+wnaci9Vw@mail.gmail.com>

On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>> >>
>> >> I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>> >>
>> >> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>> >
>> > This is very simple and obvious but I will state for the record.
>>
>> I appreciate it. What is obvious to you is not obvious to me.
>>
>> > Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>> >
>> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>> >
>> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
>> > Out[23]: False
>> >
>> > In [24]: dat == b'yes'  # Right answer but not practical
>> > Out[24]: array([ True, False], dtype=bool)
>>
>> I'm curious why you think this is not practical. It seems like a very
practical solution to me.
>
> In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

Okay, so the problem isn't with (byte-)string literals, but with variables
being passed around from other sources. Eg.

def func(dat, scalar):
    return dat == scalar

Every one of those functions deepens the abstraction and moves that
unicode-by-default scalar farther away from the bytesish array, so it's
harder to demand that users of those functions be aware that they need to
pass in `bytes` strings. So you need to implement those functions
defensively, which complicates them.

> The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

What do you think about my ASCII-surrogateescape proposal? Do you think
that would work with your use cases?

In general, I don't think Unicode sandwiches will be eliminated by this or
the latin-1 dtype; the sandwich is usually the right thing to do and the
surrogateescape the wrong thing. But I'm keenly aware of the problems you
get when there just isn't a reliable encoding to use.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/dc732201/attachment.html>

From njs at pobox.com  Mon Apr 24 22:07:23 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Mon, 24 Apr 2017 19:07:23 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
Message-ID: <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>

On Apr 21, 2017 2:34 PM, "Stephan Hoyer" <shoyer at gmail.com> wrote:

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.


You may already know this, but probably not everyone reading does: the
reason why latin1 often gets special attention in discussions of Unicode
encoding is that latin1 is effectively "ucs1". It's the unique one byte
text encoding where byte N represents codepoint U+N.

I can't think of any reason why this property is particularly important for
numpy's usage, because we always have a conversion step anyway to get data
in and out of an array. The potential arguments for latin1 that I can think
of are:
- if we have to implement our own en/decoding code for some reason then
it's the most trivial encoding
- if other formats standardize on latin1-with-nul-padding and we want
in-memory/mmap compatibility
- if we really want a fixed width encoding for some reason but don't care
which one, then it's in some sense the most obvious choice

I can't think of many reasons why having a fixed width encoding is
particularly important though... For our current style of string storage,
even calculating the length of a string is O(n), and AFAICT the only way to
actually take advantage of the theoretical O(1) character indexing is to
make a uint8 view. I guess it would be useful if we had a string slicing
ufunc... But why would we?

That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.

>From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/c1e657bd/attachment-0001.html>

From robert.kern at gmail.com  Mon Apr 24 22:23:55 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 19:23:55 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>
Message-ID: <CAF6FJis_2mPd_bdnEW1joc1f3v+mEd9F+Fw6tVYfwNkVUySY5A@mail.gmail.com>

On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs at pobox.com> wrote:

> That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.
>
> From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

Yes, HDF5 does. Or at least, it is supported in addition to the
variable-length ones.

https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/a1666162/attachment.html>

From njs at pobox.com  Mon Apr 24 22:41:31 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Mon, 24 Apr 2017 19:41:31 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJis_2mPd_bdnEW1joc1f3v+mEd9F+Fw6tVYfwNkVUySY5A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>
 <CAF6FJis_2mPd_bdnEW1joc1f3v+mEd9F+Fw6tVYfwNkVUySY5A@mail.gmail.com>
Message-ID: <CAPJVwBmsx9cVTUG6p-KmLz9THN_Yw-M4kE=o+zzhE56-_6nkog@mail.gmail.com>

On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> That said, AFAICT what people actually want in most use cases is support
>> for arrays that can hold variable-length strings, and the only place where
>> the current approach is *optimal* is when we need mmap compatibility with
>> legacy formats that use fixed-width-nul-padded fields (at which point it's
>> super convenient). It's not even possible to *represent* all Python strings
>> or bytestrings in current numpy unicode or string arrays (Python
>> strings/bytestrings can have trailing nuls). So if we're talking about
>> tweaks to the current system it probably makes sense to focus on this use
>> case specifically.
>>
>> From context I'm assuming FITS files use fixed-width-nul-padding for
>> strings? Is that right? I know HDF5 doesn't.
>
> Yes, HDF5 does. Or at least, it is supported in addition to the
> variable-length ones.
>
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

Doh, I found that page but it was (and is) meaningless to me, so I
went by http://docs.h5py.org/en/latest/strings.html, which says the
options are fixed-width ascii, variable-length ascii, or
variable-length utf-8 ... I guess it's just talking about what h5py
currently supports.

But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-). Is it important for some other reason?

Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5? (Unlikely to happen within numpy itself, but potentially
something that could be done inside h5py or whatever if numpy's
user-defined dtype system were a little more useful.)

-n

[1] hope

-- 
Nathaniel J. Smith -- https://vorpus.org

From shoyer at gmail.com  Mon Apr 24 23:01:48 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 24 Apr 2017 20:01:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBmsx9cVTUG6p-KmLz9THN_Yw-M4kE=o+zzhE56-_6nkog@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>
 <CAF6FJis_2mPd_bdnEW1joc1f3v+mEd9F+Fw6tVYfwNkVUySY5A@mail.gmail.com>
 <CAPJVwBmsx9cVTUG6p-KmLz9THN_Yw-M4kE=o+zzhE56-_6nkog@mail.gmail.com>
Message-ID: <CAEQ_Tvf+oTws+TyfAO57=xD5jgDf1rYLAsKU2uu_rRzfoNQpHw@mail.gmail.com>

On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs at pobox.com> wrote:

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-).


Of course they do :)
https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682


> Also, further searching suggests that HDF5 actually supports all of
> nul termination, nul padding, and space padding, and that nul
> termination is the default? How much does it help to have in-memory
> compatibility with just one of these options (and not even the default
> one)? Would we need to add the other options to be really useful for
> HDF5?


h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
users.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/98920793/attachment.html>

From robert.kern at gmail.com  Mon Apr 24 23:07:33 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 24 Apr 2017 20:07:33 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBmsx9cVTUG6p-KmLz9THN_Yw-M4kE=o+zzhE56-_6nkog@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CAPJVwB=VUM9O0bAx+avhZ4W3DXkYp-nhaQXQ0A=NJHwSonn-xw@mail.gmail.com>
 <CAF6FJis_2mPd_bdnEW1joc1f3v+mEd9F+Fw6tVYfwNkVUySY5A@mail.gmail.com>
 <CAPJVwBmsx9cVTUG6p-KmLz9THN_Yw-M4kE=o+zzhE56-_6nkog@mail.gmail.com>
Message-ID: <CAF6FJitECoQPWWa12uESzuL_Quxd4M0xN4cf2ic9ffMOyWP-5Q@mail.gmail.com>

On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern at gmail.com>
wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is
support
> >> for arrays that can hold variable-length strings, and the only place
where
> >> the current approach is *optimal* is when we need mmap compatibility
with
> >> legacy formats that use fixed-width-nul-padded fields (at which point
it's
> >> super convenient). It's not even possible to *represent* all Python
strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this
use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.

It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor
PyTables supports that kind of HDF5 Dataset. The variable-length Datasets
can take up a lot of disk-space because they can't be compressed (even
accounting for the wasted padding space). I mean, they probably could have
implemented it with objects arrays like h5py does with the variable-length
string Datasets, but they didn't.

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/624

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/145444e6/attachment-0001.html>

From chris.barker at noaa.gov  Tue Apr 25 12:01:05 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 09:01:05 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
Message-ID: <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern at gmail.com> wrote:

> Chris, you've mashed all of my emails together, some of them are in reply
> to you, some in reply to others. Unfortunately, this dropped a lot of the
> context from each of them, and appears to be creating some
> misunderstandings about what each person is advocating.
>

Sorry about that -- I was trying to keep an already really long thread from
getting eve3n longer....

And I'm not sure it matters who's doing the advocating, but rather *what*
is being advocated -- I hope I didn't screw that up too badly.

Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.

So I'll try again -- use-case only! we'll keep the possible solutions
separate.

Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less
characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py
and the like, or the ones writing Cython bindings to C++ libs.


2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not to be wasting space for "typical
european-language-oriented data". Note: this should ALSO be compatible with
Python's character-oriented string model. i.e. a Python String with length
N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise
an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned
with the size of the data storage and know that are using european text.

3) dtypes that support storage in particular encodings:

   Python strings would be encoded appropriately when put into the array. A
Python string would be returned when indexing.

   a) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange           with other systems (netcdf, HDF,
others???) at the binary level.

   b) There be a dtype that could store data in any encoding supported by
Python -- to facilitate bytes-level interchange with other systems. If we
need more than utf-8, then we might as well have the full set.

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object (or other memoryview?),
and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding
with no change in binary representation. This could be used to store any
binary data, including encoded text or anything else. this should map
directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that
with 'S', a if ALL the trailing bytes are null, then it is truncated, but
if there is a null byte in the middle, then it is preserved. I suspect that
this is a legacy from Py2's use of "strings" as both text and binary data.
But in py3, a "bytes" type should be about bytes, and not text, and thus
null-values bytes are simply another value a byte can hold.

There are multiple ways to address these use cases -- please try to make
your comments clear about whether you think the use-case is unimportant, or
ill-defined, or if you think a given solution is a poor choice.

To facilitate that, I will put my comments on possible solutions in a
separate note, too.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/65d9fd21/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 12:34:46 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 09:34:46 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
Message-ID: <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>

This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker at noaa.gov>
> wrote:
>
>> On the other hand, if this is the use-case, perhaps we really want an
>>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>>> signaled more explicitly. I would suggest that "text[unknown]" should
>>> support operations like a string if it can be decoded as ASCII, and
>>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>>> bytes.
>>>
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
>> really is ascii, then it's perfect. If it really is latin-*, then you get
>> some extra useful stuff, and if it's corrupted somehow, you still get the
>> ascii text correct, and the rest won't  barf and can be passed on through.
>>
>
> I am totally in agreement with Thomas that "We are living in a messy
> world right now with messy legacy datasets that have character type data
> that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they truly
> latin-1/9 vs. some other text encoding vs. non-string binary data?
>

I am totally euro-centric, but as I understand it, that is the whole point
of the desire for a compact one-byte-per character encoding. If there is a
strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
should support that. But this all started with "mostly ascii". My take on
that is:

We don't want to use pure-ASCII -- that is the hell that python2's default
encoding approach led to -- it is MUCH better to pass garbage through than
crash out with an EncodingError -- data are messy, and people are really
bad at writing comprehensive tests.

So we need something that handles ASCII properly, and can pass trhough
arbitrary bytes as well without crashing. Options are:

* ASCII With errors='ignore' or 'replace'

I think that is a very bad idea -- it is tossing away information that
_may_ have some use eslewhere::

  s = arr[i]
  arr[i] = s

should put the same bytes back into the array.

* ASCII with errors='surrogateescape'

This would preserve bytes and not crash out, so meets the key criteria.


* latin-1

This would do the exactly correct thing for ASCII, preserve the bytes, and
not crash out. But it would also allow additional symbols useful to
european languages and scientific computing. Seems like a win-win to me.

As for my use-cases:

 - Messy data:

I have had a lot of data sets with european text in them, mostly ASCII and
an occasional non ASCII accented character or symbol -- most of these come
from legacy systems, and have an ugly arbitrary combination of MacRoman,
Win-something-or-other, and who knows what -- i.e. mojibake, though at
least mostly ascii.

The only way to deal with it "properly" is to examine each string and try
to figure out which encoding it is in, hope at least a single string is in
one encoding, and then decode/encode it properly. So numpy should support
that -- which would be handled by a 'bytes' type, just like in Python
itself.

But sometimes that isn't practical, and still doesn't work 100% -- in which
case, we can go with latin-1, and there will be some weird, incorrect
characters in there, and that is OK -- we fix them later when QA/QC or
users notice it -- really just like a typo.

But stripping the non-ascii characters out would be a worse solution. As
would "replace", as sometimes it IS the correct symbol! (european encodings
aren't totally incompatible...). And surrogateescape is worse, too -- any
"weird" character is the same to my users, and at least sometimes it will
be the right character -- however surrogateescape gets printed, it will
never look right. (and can it even be handles by a non-python system?)

 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.

Granted, I should probably simply use a proper unicode type for filenames
anyway, but sometimes the data comes in already encoded as latin-something.

In the end I still see no downside to latin-1 over ascii-only -- only an
upside.

I don't think that silently (mis)interpreting non-ASCII characters as
> latin-1/9 is a good idea, which is why I think it would be a mistake to use
> 'latin-1' for text data with unknown encoding.
>

if it's totally unknown, then yes -- but for totally uknown, bytes is the
only reasonable option -- then run chardet or something over it.

but "some latin encoding" -- latin-1 is a good choice.

I could get behind a data type that compares equal to strings for ASCII
> only and allows for *storing* other characters, but making blind
> assumptions about characters 128-255 seems like a recipe for disaster.
> Imagine text[unknown] as a one character string type, but it supports
> .decode() like bytes and every character in the range 128-255 compares for
> equality with other characters like NaN -- not even equal to itself.
>

would this be ascii with surrogateescape? -- almost, though I think the
surrogateescapes would compare equal if they were equal -- which, now that
I think about it would be what you want -- why preserve the bytes if they
aren't an important part of the data?

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/7c1f6826/attachment-0001.html>

From chris.barker at noaa.gov  Tue Apr 25 12:45:20 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 09:45:20 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJivfSKDyW1_Q6F5sqEBMOhTaZWkqAKk59t4_KSBPaZ2LUQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CAF6FJivfSKDyW1_Q6F5sqEBMOhTaZWkqAKk59t4_KSBPaZ2LUQ@mail.gmail.com>
Message-ID: <CALGmxEKPN0A3Uw4geZgm74wpFMDyCvNpjPd1Pg1Sa_M1D7DXww@mail.gmail.com>

On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern <robert.kern at gmail.com> wrote:

> > My question: What are those non-ASCII characters? How often are they
> truly latin-1/9 vs. some other text encoding vs. non-string binary data?
>
> I don't know that we can reasonably make that accounting relevant. Number
> of such characters per byte of text? Number of files with such characters
> out of all existing files?
>

I have a lot of mostly english -- usually not latin-1, but usually mostly
latin-1. -- the non-ascii characters are a handful of accented characters
(usually from spanish, some french), then a few "scientific" characters:
the degree symbol, the "micro" symbol.

I suspect that this is not an unusual pattern for mostly-english scientific
text.

if it's non-string binary data, I know it -- and I'd use a bytes type.

I have two options -- try to detect the encoding properly or use
_something_ and fix it up later. latin-1 is a great choice for the later
option -- most of the text displays fine, and the wrong stuff is untouched,
so I can figure it out.

What I can say with assurance is that every time I have decided, as a
> developer, to write code that just hardcodes latin-1 for such cases, I have
> regretted it. While it's just personal anecdote, I think it's at least
> measuring the right thing. :-)
>

I've had the opposite experience -- so that's two anecdotes :-)

If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but
not really much worse then any other option other than properly decoding
it. IN a way, using latin-1 is like the old py2 string -- it can be used as
text, even if it has arbitrary non-text garbage in it...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f3b6e12f/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 12:52:06 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 09:52:06 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
Message-ID: <CALGmxEJpJPLwZhWsxbH=xw3J6MC9qoDwLDo0Ec9dCtOcd6Nrug@mail.gmail.com>

OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with
> Python3's string model: i.e. fully unicode supporting, and with a character
> oriented interface. i.e. if you do::
>
>   arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
> characters.
>
> and arr[1] will return a native Python3 string object.
>
> This is the use-case for "casual" numpy users -- not the folks writing
> H5py and the like, or the ones writing Cython bindings to C++ libs.
>

I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

    -I take it that's what Pandas does and people seem happy.

    -That would get us variable length strings, and potentially other nifty
string-processing.

   - It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?

   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.

That's it for use-case (1)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/2761b3f0/attachment.html>

From ambrose.li at gmail.com  Tue Apr 25 12:57:02 2017
From: ambrose.li at gmail.com (Ambrose LI)
Date: Tue, 25 Apr 2017 12:57:02 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
Message-ID: <CADJvFOVvo2Aw-U=pSsujva5g_sT2Psv6Eyoio0qWVJDbb_Jw-A@mail.gmail.com>

2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker at noaa.gov>:
> I am totally euro-centric, but as I understand it, that is the whole point
> of the desire for a compact one-byte-per character encoding. If there is a
> strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
> should support that. But this all started with "mostly ascii". My take on
> that is:

But Shift-JIS is not one-byte; it's two-byte (unless you allow only
half-width characters and nothing else). :-) In fact legacy CJK
encodings are all nominally two-byte (so that the width of a
character's internal representation matches that of its visual
representation).

>  - filenames
>
> File names are one of the key reasons folks struggled with the python3 data
> model (particularly on *nix) and why 'surrogateescape' was added. It's
> pretty common to store filenames in with our data, and thus in numpy arrays
> -- we need to preserve them exactly and display them mostly right. Again,
> euro-centric, but if you are euro-centric, then latin-1 is a good choice for
> this.

This I don't understand. As far as I can tell non-Western-European
filenames are not unusual. If filenames are a reason, even if you're
euro-centric (think Eastern Europe, say) I don't see how latin1 is a
good choice.

Lurker here, and I haven't touched numpy in ages. So I might be
blurting out nonsense.

-- 
Ambrose Li // http://o.gniw.ca / http://gniw.ca
If you saw this on CE-L: You do not need my permission to quote
me, only proper attribution. Always cite your sources, even if
you have to anonymize and/or cite it as "personal communication".

From chris.barker at noaa.gov  Tue Apr 25 13:04:53 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 10:04:53 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CADJvFOVvo2Aw-U=pSsujva5g_sT2Psv6Eyoio0qWVJDbb_Jw-A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CADJvFOVvo2Aw-U=pSsujva5g_sT2Psv6Eyoio0qWVJDbb_Jw-A@mail.gmail.com>
Message-ID: <CALGmxEKqq5duOGZM9++bu3zj1Gbu+kyVxXRV4CfGghZCZDQvOA@mail.gmail.com>

On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI <ambrose.li at gmail.com> wrote:

> 2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker at noaa.gov>:
> > I am totally euro-centric,
>


> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
> half-width characters and nothing else). :-)


bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)


> This I don't understand. As far as I can tell non-Western-European
> filenames are not unusual. If filenames are a reason, even if you're
> euro-centric (think Eastern Europe, say) I don't see how latin1 is a
> good choice.
>

right -- this is the age of Unicode -- Unicode is the correct choice.

But many of us have data in old files that are not proper Unicode -- and
that includes filenames.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/3b827e51/attachment-0001.html>

From robert.kern at gmail.com  Tue Apr 25 13:07:55 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 10:07:55 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
Message-ID: <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>

On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.
>
> So I'll try again -- use-case only! we'll keep the possible solutions
separate.
>
> Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

... etc.

These aren't use cases but rather requirements. I'm looking for something
rather more concrete than that.

* HDF5 supports fixed-length and variable-length string arrays encoded in
ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
the documentation claiming that there are more options). In practice, the
ASCII strings permit high-bit characters, but the encoding is unspecified.
Memory-mapping is rare (but apparently possible). The two major HDF5
bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
HDF5 option. Compression is supported for fixed-length string arrays but
not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The
strings do not have a formal encoding, but in practice, they are typically
mostly ASCII characters with the occasional high-bit character from an
unspecific encoding. Memory-mapping is a common practice. These arrays can
be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string
columns. Lengths are not fixed, and None is used as a marker for missing
data. String columns must be written to and read from a variety of formats,
including CSV, Excel, and HDF5, some of which are Unicode-aware and work
with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented,
often-poorly-adhered-to, aging file format "standards" that include string
arrays but do not specify encodings, or such specification is ignored in
practice. This can make the usual "Unicode sandwich" at the I/O boundaries
difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common,
and simple operations like equality comparisons no longer work between
`bytes` and `unicode/str`, making it difficult to work with numpy string
arrays that yield `bytes` scalars.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/2e69f47c/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 13:02:17 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Tue, 25 Apr 2017 10:02:17 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
Message-ID: <CALGmxEKzctZGji5m_8yJ6nbsgjAGkNcPtMf6RCGPtRDd908a3Q@mail.gmail.com>

Now my proposal for the other use cases:

2) There be some way to store mostly ascii-compatible strings in a single
> byte-per-character array -- so not to be wasting space for "typical
> european-language-oriented data". Note: this should ALSO be compatible with
> Python's character-oriented string model. i.e. a Python String with length
> N will fit into a dtype of size N.
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding String  would
> raise an EncodingError.
>
> This is also a use-case primarily for "casual" users -- but ones concerned
> with the size of the data storage and know that are using european text.
>

more detail elsewhere -- but either ascii with surrageescape or latin-1
always are good options here. I prefer latin-1 (I really see no downside),
but others disagree...

But then we get to:


> 3) dtypes that support storage in particular encodings:
>

We need utf-8. We may need others. We may need a 1-byte per char compact
encoding that isn't close enough to ascii or latin-1 to be useful (say,
shift-jis), And I don't think we are going to come to a consensus on what
"single" encoding to use for 1-byte-per-char.

So really -- going back to Julian's earlier proposal:

dytpe with an encoding specified
"size" in bytes

once defined, numpy would encode/decode to/from python strings "correctly"

we might need "null-terminated utf-8" as a special case.

That would support all the other use cases.

Even the one-byte per char encoding. I"d like to see a clean alias to a
latin-1 encoding, but not a big deal.

That leaves a couple decisions:

 - error out or truncate if the passed-in string is too long?

 - error out or suragateescape if there are invalid bytes in the data?

 - error out or something else if there are characters that can't be
encoded in the specified encoding.

And we still need a proper bytes type:

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
> three -- settable from a bytes or bytearray object (or other memoryview?),
> and returns a bytes object.
>
> You could use astype() to convert between bytes and a specified encoding
> with no change in binary representation. This could be used to store any
> binary data, including encoded text or anything else. this should map
> directly to the Python bytes model -- thus NOT null-terminted.
>
> This is a little different than 'S' behaviour on py3 -- it appears that
> with 'S', a if ALL the trailing bytes are null, then it is truncated, but
> if there is a null byte in the middle, then it is preserved. I suspect that
> this is a legacy from Py2's use of "strings" as both text and binary data.
> But in py3, a "bytes" type should be about bytes, and not text, and thus
> null-values bytes are simply another value a byte can hold.
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/d6976a5e/attachment.html>

From peridot.faceted at gmail.com  Tue Apr 25 13:12:54 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Tue, 25 Apr 2017 17:12:54 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
Message-ID: <CANm_+Zr_8YLQ5wj6b3F+Vp9s-taexkS7zQAtCJWgYsHMJVL+RQ@mail.gmail.com>

On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.

2) User has to deal with fixed-width binary data from an external
program/library and wants to see it as python strings. This may be
systematically encoded in a known encoding (e.g. HDF5's
fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS'
zero-padded ASCII) or
ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating
FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length
may be signaled by null termination, null padding, or space padding.
-> Solvable with a fixed-storage-size encoded-string dtype, as long as it
has a parameter for how length is signaled. Python tricks for dealing with
wrong or unknown encodings can make bogus data manageable.

3) User has to deal with fixed-width binary data from an external
program/library that really is binary bytes.
-> Solvable with a dtype that returns fixed-length byte strings.

4) User has a stupendous number (billions) of short strings which are
mostly but not entirely ASCII and wants to manipulate them as strings.
-> Not sure how to solve this. Maybe an object array with byte strings for
storage and encoding information in the dtype, allowing transparent
decoding? Or a fixed-storage-size array with a one-byte encoding that can
cope with all the characters the user will ever want to use?

5) User has a bunch of mystery-encoding strings(?) and wants to store them
in a numpy array.
-> If they're python strings already, no further harm is done by treating
this as case 1 when in python-land. If they need to be in fixed-width
fields for communication with an external program or library, this puts us
in case 2, unknown encoding variety; user will have to pick an encoding
that the external program is likely to be able to cope with; this may be
the one that originated the mystery strings in the first place.

6) User has python strings and wants to store them in non-object numpy
arrays for some reason but doesn't care about the actual memory layout.
-> Solvable with the current setup; fixed-width UCS-4 fields, padded with
Unicode NULL. Happily, this comes for free from arbitrary-encoding
fixed-storage-size dtypes, though a friendlier interface might be nice.
Also allows people to use UCS-2 or ASCII if they know their strings fit.

7) User has data in one binary format and it needs to go into another, with
perhaps casual inspection while in python-land. Such data is mostly ASCII
but might contain mystery characters; presenting gobbledygook to the user
is okay as long as the characters are output intact.
-> Reading and writing as a fixed-width one-byte encoding, preferably one
resembling the one the data is actually in, should work here. UTF-8 is
likely to mangle the data; ASCII-with-surrogateescape might do okay. The
key thing here is that both input and output files will have their own ways
of specifying string length and their own storage specifiers; user must
know these, and someone has to know and specify what to do with strings
that don't fit. Simple truncation will mangle UTF-8 if it is not known to
be UTF-8, but there's maybe not much that can be done about that.

I guess my point is that a use case should specify:
* Where does the data come from (i.e. in what format)?
* Are there memory constraints in the storage format?
* How should access look to the user? In particular, what should misencoded
data look like?
* Where does the data need to go?

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f879d23f/attachment-0001.html>

From robert.kern at gmail.com  Tue Apr 25 13:15:27 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 10:15:27 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKqq5duOGZM9++bu3zj1Gbu+kyVxXRV4CfGghZCZDQvOA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CADJvFOVvo2Aw-U=pSsujva5g_sT2Psv6Eyoio0qWVJDbb_Jw-A@mail.gmail.com>
 <CALGmxEKqq5duOGZM9++bu3zj1Gbu+kyVxXRV4CfGghZCZDQvOA@mail.gmail.com>
Message-ID: <CAF6FJivqPX-ytK+wo_O62a9Xc8tKqxGAQaNBBOe4MiO=0iuOKA@mail.gmail.com>

On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>
> On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI <ambrose.li at gmail.com> wrote:
>>
>> 2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker at noaa.gov>:
>> > I am totally euro-centric,
>
>> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
>> half-width characters and nothing else). :-)
>
> bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)

I've run into Windows-1251 in files (seismic and well log data from Russian
wells). Treating them as latin-1 does not make for a happy time. Both
encodings also technically derive from ASCII in the lower half, but most of
the actual language is written with the high-bit characters.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/c01461e4/attachment.html>

From peridot.faceted at gmail.com  Tue Apr 25 13:34:37 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Tue, 25 Apr 2017 17:34:37 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
Message-ID: <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>

On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com> wrote:

> * HDF5 supports fixed-length and variable-length string arrays encoded in
> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
> the documentation claiming that there are more options). In practice, the
> ASCII strings permit high-bit characters, but the encoding is unspecified.
> Memory-mapping is rare (but apparently possible). The two major HDF5
> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
> HDF5 option. Compression is supported for fixed-length string arrays but
> not variable-length string arrays.
>
> * FITS supports fixed-length string arrays that are NULL-padded. The
> strings do not have a formal encoding, but in practice, they are typically
> mostly ASCII characters with the occasional high-bit character from an
> unspecific encoding. Memory-mapping is a common practice. These arrays can
> be quite large even if each scalar is reasonably small.
>
> * pandas uses object arrays for flexible in-memory handling of string
> columns. Lengths are not fixed, and None is used as a marker for missing
> data. String columns must be written to and read from a variety of formats,
> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
> with `unicode/str` objects instead of `bytes`.
>
> * There are a number of sometimes-poorly-documented,
> often-poorly-adhered-to, aging file format "standards" that include string
> arrays but do not specify encodings, or such specification is ignored in
> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
> difficult to perform.
>
> * In Python 3 environments, `unicode/str` objects are rather more common,
> and simple operations like equality comparisons no longer work between
> `bytes` and `unicode/str`, making it difficult to work with numpy string
> arrays that yield `bytes` scalars.
>

It seems the greatest challenge is interacting with binary data from other
programs and libraries. If we were living entirely in our own data world,
Unicode strings in object arrays would generally be pretty satisfactory. So
let's try to get what is needed to read and write other people's formats.

I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map
directly to numpy arrays; we can store it however we want, as conversion is
necessary anyway.

Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other
packages are waiting specifically for it. But specifying this requires two
pieces of information: What is the encoding? and How is the length
specified? I know they're not numpy-compatible, but FITS header values are
space-padded; does that occur elsewhere? Are there other ways existing data
specifies string length within a fixed-size field? There are some
cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7,
etc. - but they are probably too specialized to need? We should make sure
we can support all the ways that actually occur.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/55de9e8a/attachment.html>

From hodge at stsci.edu  Tue Apr 25 13:51:19 2017
From: hodge at stsci.edu (Phil Hodge)
Date: Tue, 25 Apr 2017 13:51:19 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
Message-ID: <a514cda4-12d1-c767-d1ba-a91e9f5e12bf@stsci.edu>

On 04/25/2017 01:34 PM, Anne Archibald wrote:
> I know they're not numpy-compatible, but FITS header values are 
> space-padded; does that occur elsewhere?

Strings in FITS headers are delimited by single quotes.  Some keywords 
(only a handful) are required to have values that are blank-padded (in 
the FITS file) if the value is less than eight characters.  Whether you 
get trailing blanks when you read the header depends on the FITS 
reader.  I use astropy.io.fits to read/write FITS files, and that 
interface strips trailing blanks from character strings:

TARGPROP= 'UNKNOWN '           / Proposer's name for the target

 >>> fd = fits.open("test.fits")
 >>> s = fd[0].header['targprop']
 >>> len(s)
7

Phil


From peridot.faceted at gmail.com  Tue Apr 25 13:54:39 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Tue, 25 Apr 2017 17:54:39 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
Message-ID: <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>

On Tue, Apr 25, 2017 at 6:36 PM Chris Barker <chris.barker at noaa.gov> wrote:

>
> This is essentially my rant about use-case (2):
>
> A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do. Do you need
your in-memory format for this data to be compatible with anything in
particular?

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?

Presumably you're getting byte strings (with no NULLs) from somewhere and
need to store them in this memory structure in a way that makes them as
usable as possible in spite of their unknown encoding. Presumably the thing
to do is just copy them in there as-is and then use .astype to arrange for
python to decode them when accessed. So this is precisely the problem of
"how should I decode random byte strings?" that python has been struggling
with. My impression is that the community has established that there's no
one solution that makes everyone happy, but that most people can cope with
some combination of picking a one-byte encoding,
ascii-with-surrogateescapes, zapping bogus characters, and giving wrong
results. But I think that all the standard python alternatives are needed,
in general, and in terms of interpreting numpy arrays full of bytes.
Clearly your preferred solution is .astype("string[latin-9]"), but just as
clearly that's not going to work for everyone.

If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays; anyone who just has a bunch of python
strings to store is unlikely to be surprised by this. Someone with more
specific needs will choose a more specific - that is, not default - string
data type.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/8eec2db9/attachment-0001.html>

From peridot.faceted at gmail.com  Tue Apr 25 14:00:20 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Tue, 25 Apr 2017 18:00:20 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <a514cda4-12d1-c767-d1ba-a91e9f5e12bf@stsci.edu>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <a514cda4-12d1-c767-d1ba-a91e9f5e12bf@stsci.edu>
Message-ID: <CANm_+ZrKACJwgP431zfaRLjsmA6iWCox4VgmONv8Rp7OQpGmMg@mail.gmail.com>

On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge <hodge at stsci.edu> wrote:

> On 04/25/2017 01:34 PM, Anne Archibald wrote:
> > I know they're not numpy-compatible, but FITS header values are
> > space-padded; does that occur elsewhere?
>
> Strings in FITS headers are delimited by single quotes.  Some keywords
> (only a handful) are required to have values that are blank-padded (in
> the FITS file) if the value is less than eight characters.  Whether you
> get trailing blanks when you read the header depends on the FITS
> reader.  I use astropy.io.fits to read/write FITS files, and that
> interface strips trailing blanks from character strings:
>
> TARGPROP= 'UNKNOWN '           / Proposer's name for the target
>
>  >>> fd = fits.open("test.fits")
>  >>> s = fd[0].header['targprop']
>  >>> len(s)
> 7
>

Actually, for what it's worth, the FITS spec says that in such values
trailing spaces are not significant, see page 7:
https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf
But they're not really relevant to numpy's situation, because as here you
need to do elaborate de-quoting before they can go into a data structure.
What I was wondering was whether people have data lying around with
fixed-width fields where the strings are space-padded, so that numpy needs
to support that.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/ed3bf3e3/attachment.html>

From charlesr.harris at gmail.com  Tue Apr 25 14:18:57 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Tue, 25 Apr 2017 12:18:57 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
Message-ID: <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:

>
> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com> wrote:
>
>> * HDF5 supports fixed-length and variable-length string arrays encoded in
>> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
>> the documentation claiming that there are more options). In practice, the
>> ASCII strings permit high-bit characters, but the encoding is unspecified.
>> Memory-mapping is rare (but apparently possible). The two major HDF5
>> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
>> HDF5 option. Compression is supported for fixed-length string arrays but
>> not variable-length string arrays.
>>
>> * FITS supports fixed-length string arrays that are NULL-padded. The
>> strings do not have a formal encoding, but in practice, they are typically
>> mostly ASCII characters with the occasional high-bit character from an
>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>> be quite large even if each scalar is reasonably small.
>>
>> * pandas uses object arrays for flexible in-memory handling of string
>> columns. Lengths are not fixed, and None is used as a marker for missing
>> data. String columns must be written to and read from a variety of formats,
>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>> with `unicode/str` objects instead of `bytes`.
>>
>> * There are a number of sometimes-poorly-documented,
>> often-poorly-adhered-to, aging file format "standards" that include string
>> arrays but do not specify encodings, or such specification is ignored in
>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>> difficult to perform.
>>
>> * In Python 3 environments, `unicode/str` objects are rather more common,
>> and simple operations like equality comparisons no longer work between
>> `bytes` and `unicode/str`, making it difficult to work with numpy string
>> arrays that yield `bytes` scalars.
>>
>
> It seems the greatest challenge is interacting with binary data from other
> programs and libraries. If we were living entirely in our own data world,
> Unicode strings in object arrays would generally be pretty satisfactory. So
> let's try to get what is needed to read and write other people's formats.
>
> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
> map directly to numpy arrays; we can store it however we want, as
> conversion is necessary anyway.
>
> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
>

Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

For  byte strings, it looks like we need a parameterized type. This is for
two uses, display and conversion to (Python) unicode. One could handle the
display and conversion using view and astype methods. For instance, we
already have

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('S1').astype('U')
Out[3]:
array([u'1', u'2', u'3'],
      dtype='<U1')

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/be97c977/attachment.html>

From wieser.eric+numpy at gmail.com  Tue Apr 25 14:46:36 2017
From: wieser.eric+numpy at gmail.com (Eric Wieser)
Date: Tue, 25 Apr 2017 18:46:36 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
Message-ID: <CAL1kJvBbN7gAJEwdFnHsepUizikL7X+ymgKAu9gCeG1ArtUZfQ@mail.gmail.com>

Chuck: That sounds like something we want to deprecate, for the same reason
that python3 no longer allows str(b'123') to do the right thing.

Specifically, it seems like astype should always be forbidden to go between
unicode and byte arrays - so that would need to be written as:

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('U[ascii]')
Out[3]:
array([u'1', u'2', u'3'],
      dtype='<U[ascii]1')

In [4]: a.view('U[ascii]').astype('U[ucs32]')  # re-encoding is a
astype operation
Out[4]:
array([u'1', u'2', u'3'],
      dtype='<U1')     # UCS32 is the current default

In [5]: a.view('U[ascii]').astype('U[ucs32]').view(uint8)
Out [5]:
array([0x31, 0, 0, 0, 0x32, 0, 0, 0, 0x33, 0, 0, 0])

I guess for backwards compatibility, .view('U') would always mean
view('U[ucs32]').

As an aside - it?d be nice if parameterized dtypes acquired a non-string
syntax, like np.unicode_['ucs32'].

Eric

On Tue, 25 Apr 2017 at 19:19 Charles R Harris charlesr.harris at gmail.com
<http://mailto:charlesr.harris at gmail.com> wrote:

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted at gmail.com>
> wrote:
>
>>
>> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com>
>> wrote:
>>
>>> * HDF5 supports fixed-length and variable-length string arrays encoded
>>> in ASCII and UTF-8. In all cases, these strings are NULL-terminated
>>> (despite the documentation claiming that there are more options). In
>>> practice, the ASCII strings permit high-bit characters, but the encoding is
>>> unspecified. Memory-mapping is rare (but apparently possible). The two
>>> major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to
>>> support that HDF5 option. Compression is supported for fixed-length string
>>> arrays but not variable-length string arrays.
>>>
>>> * FITS supports fixed-length string arrays that are NULL-padded. The
>>> strings do not have a formal encoding, but in practice, they are typically
>>> mostly ASCII characters with the occasional high-bit character from an
>>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>>> be quite large even if each scalar is reasonably small.
>>>
>>> * pandas uses object arrays for flexible in-memory handling of string
>>> columns. Lengths are not fixed, and None is used as a marker for missing
>>> data. String columns must be written to and read from a variety of formats,
>>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>>> with `unicode/str` objects instead of `bytes`.
>>>
>>> * There are a number of sometimes-poorly-documented,
>>> often-poorly-adhered-to, aging file format "standards" that include string
>>> arrays but do not specify encodings, or such specification is ignored in
>>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>>> difficult to perform.
>>>
>>> * In Python 3 environments, `unicode/str` objects are rather more
>>> common, and simple operations like equality comparisons no longer work
>>> between `bytes` and `unicode/str`, making it difficult to work with numpy
>>> string arrays that yield `bytes` scalars.
>>>
>>
>> It seems the greatest challenge is interacting with binary data from
>> other programs and libraries. If we were living entirely in our own data
>> world, Unicode strings in object arrays would generally be pretty
>> satisfactory. So let's try to get what is needed to read and write other
>> people's formats.
>>
>> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
>> map directly to numpy arrays; we can store it however we want, as
>> conversion is necessary anyway.
>>
>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> For  byte strings, it looks like we need a parameterized type. This is for
> two uses, display and conversion to (Python) unicode. One could handle the
> display and conversion using view and astype methods. For instance, we
> already have
>
> In [1]: a = array([1,2,3], uint8) + 0x30
>
> In [2]: a.view('S1')
> Out[2]:
> array(['1', '2', '3'],
>       dtype='|S1')
>
> In [3]: a.view('S1').astype('U')
> Out[3]:
> array([u'1', u'2', u'3'],
>       dtype='<U1')
>
> Chuck
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a7605e49/attachment-0001.html>

From robert.kern at gmail.com  Tue Apr 25 14:52:08 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 11:52:08 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
Message-ID: <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/4ebefacd/attachment.html>

From njs at pobox.com  Tue Apr 25 15:29:22 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Tue, 25 Apr 2017 12:29:22 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
 <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
Message-ID: <CAPJVwBnF+wLPCA2-XeLzr3Z3W_oGUj98fcs7vtBdQwLe3-g_7w@mail.gmail.com>

On Apr 25, 2017 11:53 AM, "Robert Kern" <robert.kern at gmail.com> wrote:

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.


And to save anyone else having to check, numpy's current NUL-padded dtypes
only strip trailing NULs, so they can round-trip strings that contain NULs,
just not strings where NUL is the last character.

So the set of strings representable by str/bytes is a strict superset of
the set of strings representable by numpy U/S dtypes, which in turn is a
strict superset of the set of strings representable by a hypothetical
NUL-terminated dtype.

(Of course this doesn't matter for most practical purposes, because people
rarely make strings with embedded NULs.)

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/872e5537/attachment.html>

From charlesr.harris at gmail.com  Tue Apr 25 15:30:27 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Tue, 25 Apr 2017 13:30:27 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
 <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
Message-ID: <CAB6mnxLCPCXQGm7dN3-e5RE9BfEV8ysWLeURizNZx4WjJzfN9Q@mail.gmail.com>

On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
> >
> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
> peridot.faceted at gmail.com> wrote:
>
> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
> >
> >
> > Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> Just to clarify some terminology (because it wasn't originally clear to me
> until I looked it up in reference to HDF5):
>
> * "NULL-padded" implies that, for a fixed width of N, there can be up to N
> non-NULL bytes. Any extra space left over is padded with NULLs, but no
> space needs to be reserved for NULLs.
>
> * "NULL-terminated" implies that, for a fixed width of N, there can be up
> to N-1 non-NULL bytes. There must always be space reserved for the
> terminating NULL.
>
> I'm not really sure if "NULL-padded" also specifies the behavior for
> embedded NULLs. It's certainly possible to deal with them: just strip
> trailing NULLs and leave any embedded ones alone. But I'm also sure that
> there are some implementations somewhere that interpret the requirement as
> "stop at the first NULL or the end of the fixed width, whichever comes
> first", effectively being NULL-terminated just not requiring the reserved
> space.
>

Thanks for the clarification. NULL-padded is what I meant.

I'm wondering how much of the desired functionality we could get by simply
subclassing ndarray in python. I think we mostly want to be able to view
byte strings and convert to unicode if needed.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/05a298b3/attachment-0001.html>

From robert.kern at gmail.com  Tue Apr 25 15:36:27 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 12:36:27 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnxLCPCXQGm7dN3-e5RE9BfEV8ysWLeURizNZx4WjJzfN9Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
 <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
 <CAB6mnxLCPCXQGm7dN3-e5RE9BfEV8ysWLeURizNZx4WjJzfN9Q@mail.gmail.com>
Message-ID: <CAF6FJitRNnoHHJxSiqdkYUG3q1Mz3g5OEy2NxnY=AEgpYPQ+Cw@mail.gmail.com>

On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.faceted at gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
N non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be
up to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.
>
> Thanks for the clarification. NULL-padded is what I meant.

Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is
NULL-terminated.

> I'm wondering how much of the desired functionality we could get by
simply subclassing ndarray in python. I think we mostly want to be able to
view byte strings and convert to unicode if needed.

I'm not sure. Some of these fixed-width string arrays are embedded inside
structured arrays with other dtypes.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a5b1afeb/attachment.html>

From njs at pobox.com  Tue Apr 25 15:37:16 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Tue, 25 Apr 2017 12:37:16 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
Message-ID: <CAPJVwBmzMJ3Ptz3+aB+u-ejdmujGm-xDX3rfnx+iKAnAXib4JA@mail.gmail.com>

On Apr 25, 2017 9:35 AM, "Chris Barker" <chris.barker at noaa.gov> wrote:


 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.


Eh... First, on Windows and MacOS, filenames are natively Unicode. So you
don't care about preserving the bytes, only the characters. It's only Linux
and the other traditional unixes where filenames are natively bytestrings.
And then from in Python, if you want to actually work with those filenames
you need to either have a bytestring type or else a Unicode type that uses
surrogateescape to represent the non-ascii characters. I'm not seeing how
latin1 really helps anything here -- best case you still have to do
something like the wsgi "encoding dance" before you could use the
filenames. IMO if you have filenames that are arbitrary bytestrings and you
need to represent this properly, you should just use bytestrings -- really,
they're perfectly friendly :-).

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/705baf08/attachment.html>

From charlesr.harris at gmail.com  Tue Apr 25 15:38:19 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Tue, 25 Apr 2017 13:38:19 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnxLCPCXQGm7dN3-e5RE9BfEV8ysWLeURizNZx4WjJzfN9Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <CAB6mnx+AS4cPnJ9oG=HcaAa5Tyck45_Cvtzw6kgBrsKATj0ozg@mail.gmail.com>
 <CAF6FJitva9wvyqs=ravgT2Pzz8uYfok_6=eiQpgeJKtWjgQ_6w@mail.gmail.com>
 <CAB6mnxLCPCXQGm7dN3-e5RE9BfEV8ysWLeURizNZx4WjJzfN9Q@mail.gmail.com>
Message-ID: <CAB6mnx+KvizSNAzpa=tSULwLfbzCERHuz-amS7f6kzmLpG+LFQ@mail.gmail.com>

On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
>> peridot.faceted at gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>> >
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
>> towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
>> me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
>> N non-NULL bytes. Any extra space left over is padded with NULLs, but no
>> space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be up
>> to N-1 non-NULL bytes. There must always be space reserved for the
>> terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
>> embedded NULLs. It's certainly possible to deal with them: just strip
>> trailing NULLs and leave any embedded ones alone. But I'm also sure that
>> there are some implementations somewhere that interpret the requirement as
>> "stop at the first NULL or the end of the fixed width, whichever comes
>> first", effectively being NULL-terminated just not requiring the reserved
>> space.
>>
>
> Thanks for the clarification. NULL-padded is what I meant.
>
> I'm wondering how much of the desired functionality we could get by simply
> subclassing ndarray in python. I think we mostly want to be able to view
> byte strings and convert to unicode if needed.
>
>
And I think the really tricky part is sorting and rich comparison.
Unfortunately, the comparison function is currently located in the c
structure. I suppose we could define a c wrapper function to go in the slot.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a5b45623/attachment-0001.html>

From njs at pobox.com  Tue Apr 25 15:52:06 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Tue, 25 Apr 2017 12:52:06 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+Zr_8YLQ5wj6b3F+Vp9s-taexkS7zQAtCJWgYsHMJVL+RQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CANm_+Zr_8YLQ5wj6b3F+Vp9s-taexkS7zQAtCJWgYsHMJVL+RQ@mail.gmail.com>
Message-ID: <CAPJVwBnc+v9qxiGy6sP1NGW=oxe5cXiR3mGyfgc1XTyLqQqptg@mail.gmail.com>

On Apr 25, 2017 10:13 AM, "Anne Archibald" <peridot.faceted at gmail.com>
wrote:


On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.


It's possible to do much better than this when defining a specialized
variable-width string dtype. E.g. make the itemsize 8 bytes (like an object
array, assuming a 64 bit system), but then for strings that can be encoded
in 7 bytes or less of utf8 store them directly in the array; else store a
pointer to a raw utf8 string on the heap. (Possibly with a reference count
- there are some interesting tradeoffs there. I suspect 1-byte reference
counts might be the way to go; if a logical copy would make it overflow
then make an actual copy instead.) Anything involving the heap is going to
have some overhead, but we don't need full fledged Python objects and once
we give up mmap compatibility then there's a lot of room to tune.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/1ee5e0e5/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 18:47:46 2017
From: chris.barker at noaa.gov (Chris Barker - NOAA Federal)
Date: Tue, 25 Apr 2017 15:47:46 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
Message-ID: <1229716955908306730@unknownmsgid>

A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do.


Actually, *I* am not trying to do anything here -- I'm the one that said
computers are so big and fast now that we shouldn't whine about 4 bytes for
a character....but this whole conversation started with that request...and
I have sympathy .. no one likes to waste memory. After all, numpy support
small numeric dtypes, too.

Do you need your in-memory format for this data to be compatible with
anything in particular?


Not for this requirement -- binary interchange is another requirement.

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?


That's the whole point, yes. Object arrays would be a good solution to the
full Unicode problem, not the "why am I wasting so much space when all my
data are ascii ?

Presumably you're getting byte strings (with  unknown encoding.


No -- thus is for creating and using mostly ascii string data with python
and numpy.

Unknown encoding bytes belong in byte arrays -- they are not text.

I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with
a few extra characters" data. With all the sloppiness over the years, there
are way to many files like that.

Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.


If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays;


Or UCS-4.

I think object arrays would be more problematic for npz storage, and raw
"tostring" dumping. (And pickle?) not sure how important that is.

And it would be good to have something that plays well with recarrays

anyone who just has a bunch of python strings to store is unlikely to be
surprised by this. Someone with more specific needs will choose a more
specific - that is, not default - string data type.


Exactly.

-CHB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f2189e04/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 18:50:05 2017
From: chris.barker at noaa.gov (Chris Barker - NOAA Federal)
Date: Tue, 25 Apr 2017 15:50:05 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZrKACJwgP431zfaRLjsmA6iWCox4VgmONv8Rp7OQpGmMg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAMtEP6ykcjB7GNgfoe5kp6+3Apn9zLZBeBEjs0Px=-P0RNghDQ@mail.gmail.com>
 <CALGmxELrnrfOHBa4hPDGimSr8gAaRheFgjvVA7P8duS5_nFTkg@mail.gmail.com>
 <CAF6FJivZ_Rth7SjFBfMB-n5=kxa925LLMZV3JyQOKnFoqGNmAw@mail.gmail.com>
 <CALGmxEJkfFhWPgN8Ce_oNhFKnSv12EEJ5COvXazk0q_x0bE7Cg@mail.gmail.com>
 <CAF6FJiuD7hFWxu+KKQcgXwmT1=EOAALBfxi+vCpW5pDkEDahRA@mail.gmail.com>
 <CALGmxE+HR0nH2nYa-70vp8RBMsjNuLhvpKv6YE2wVr5o1ALW5A@mail.gmail.com>
 <CAF6FJitFZzKXeW_26BD1WG6w+h8cja-G7E3-xUtapot63YZBAQ@mail.gmail.com>
 <CANm_+ZqrtngQ7mxC8aUdvLvHK+jdtf-OErxTVF55hPRrf464nQ@mail.gmail.com>
 <a514cda4-12d1-c767-d1ba-a91e9f5e12bf@stsci.edu>
 <CANm_+ZrKACJwgP431zfaRLjsmA6iWCox4VgmONv8Rp7OQpGmMg@mail.gmail.com>
Message-ID: <5825136600369965739@unknownmsgid>

Actually, for what it's worth, the FITS spec says that in such values
trailing spaces are not significant, see page 7:
https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf
But they're not really relevant to numpy's situation, because as here you
need to do elaborate de-quoting before they can go into a data structure.
What I was wondering was whether people have data lying around with
fixed-width fields where the strings are space-padded, so that numpy needs
to support that.


I would say whether to strip space-padded strings should be the reader's
problem, not numpy's

-CHB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/0b9ef241/attachment.html>

From chris.barker at noaa.gov  Tue Apr 25 19:11:27 2017
From: chris.barker at noaa.gov (Chris Barker - NOAA Federal)
Date: Tue, 25 Apr 2017 16:11:27 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBmzMJ3Ptz3+aB+u-ejdmujGm-xDX3rfnx+iKAnAXib4JA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CAPJVwBmzMJ3Ptz3+aB+u-ejdmujGm-xDX3rfnx+iKAnAXib4JA@mail.gmail.com>
Message-ID: <-2179002348619298640@unknownmsgid>

> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs at pobox.com> wrote:

> Eh... First, on Windows and MacOS, filenames are natively Unicode.

Yeah, though once they are stored I. A text file -- who the heck
knows? That may be simply unsolvable.
> s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters.


> IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-).

I thought the Python file (and Path) APIs all required (Unicode)
strings? That was the whole complaint!

And no, bytestrings are not perfectly friendly in py3.

This got really complicated and sidetracked, but All I'm suggesting is
that if we have a 1byte per char string type, with a fixed encoding,
that that encoding be Latin-1, rather than ASCII.

That's it, really.

Having a settable encoding would work fine, too.

-CHB

From robert.kern at gmail.com  Tue Apr 25 19:50:05 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 16:50:05 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <1229716955908306730@unknownmsgid>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
Message-ID: <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>

On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
chris.barker at noaa.gov> wrote:

>> Presumably you're getting byte strings (with  unknown encoding.
>
> No -- thus is for creating and using mostly ascii string data with python
and numpy.
>
> Unknown encoding bytes belong in byte arrays -- they are not text.

You are welcome to try to convince Thomas of that. That is the status quo
for him, but he is finding that difficult to work with.

> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
with a few extra characters" data. With all the sloppiness over the years,
there are way to many files like that.

That sloppiness that you mention is precisely the "unknown encoding"
problem. Your previous advocacy has also touched on using latin-1 to decode
existing files with unknown encodings as well. If you want to advocate for
using latin-1 only for the creation of new data, maybe stop talking about
existing files? :-)

> Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.

For that use case, the alternative in play isn't ASCII, it's UTF-8, which
buys you a whole bunch of useful extra characters. ;-)

There are several use cases being brought forth here. Some involve file
reading, some involve file writing, and some involve in-memory
manipulation. Whatever change we make is going to impinge somehow on all of
the use cases. If all we do is add a latin-1 dtype for people to use to
create new in-memory data, then someone is going to use it to read existing
data in unknown or ambiguous encodings.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a0984325/attachment.html>

From njs at pobox.com  Tue Apr 25 20:41:22 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Tue, 25 Apr 2017 17:41:22 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <-2179002348619298640@unknownmsgid>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CAPJVwBmzMJ3Ptz3+aB+u-ejdmujGm-xDX3rfnx+iKAnAXib4JA@mail.gmail.com>
 <-2179002348619298640@unknownmsgid>
Message-ID: <CAPJVwBm9OO6637Ri=A-qyKTYQQJkuYb4GTxx5VLYhxn9y7XKsg@mail.gmail.com>

On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal
<chris.barker at noaa.gov> wrote:
>> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
>> s. And then from in Python, if you want to actually work with those filenames you need to either have a bytestring type or else a Unicode type that uses surrogateescape to represent the non-ascii characters.
>
>
>> IMO if you have filenames that are arbitrary bytestrings and you need to represent this properly, you should just use bytestrings -- really, they're perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!

No, the path APIs all accept bytestrings (and ones that return
pathnames like listdir return bytestrings if given bytestrings). Or at
least they're supposed to.

The really urgent need for surrogateescape was things like sys.argv
and os.environ where arbitrary bytes might come in (on some systems)
but the API is restricted to strs.

> And no, bytestrings are not perfectly friendly in py3.

I'm not saying you should use them everywhere or that they remove the
need for an ergonomic text dtype, but when you actually want to work
with bytes they're pretty good (esp. in modern py3).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org

From charlesr.harris at gmail.com  Tue Apr 25 21:27:57 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Tue, 25 Apr 2017 19:27:57 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
Message-ID: <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>

On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
> chris.barker at noaa.gov> wrote:
>
> >> Presumably you're getting byte strings (with  unknown encoding.
> >
> > No -- thus is for creating and using mostly ascii string data with
> python and numpy.
> >
> > Unknown encoding bytes belong in byte arrays -- they are not text.
>
> You are welcome to try to convince Thomas of that. That is the status quo
> for him, but he is finding that difficult to work with.
>
> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
> with a few extra characters" data. With all the sloppiness over the years,
> there are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding"
> problem. Your previous advocacy has also touched on using latin-1 to decode
> existing files with unknown encodings as well. If you want to advocate for
> using latin-1 only for the creation of new data, maybe stop talking about
> existing files? :-)
>
> > Note: the primary use-case I have in mind is working with ascii text in
> numpy arrays efficiently-- folks have called for that. All I'm saying is
> use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
> buys you a whole bunch of useful extra characters. ;-)
>
> There are several use cases being brought forth here. Some involve file
> reading, some involve file writing, and some involve in-memory
> manipulation. Whatever change we make is going to impinge somehow on all of
> the use cases. If all we do is add a latin-1 dtype for people to use to
> create new in-memory data, then someone is going to use it to read existing
> data in unknown or ambiguous encodings.
>


The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

Note that for terminal display we will want something supported by the
system, which is another problem altogether. Let me break the problem down
into four categories


   1. Storage -- hdf5, .npy, fits, etc.
   2. Display -- ?
   3. Modification -- editing
   4. Parsing -- fits, etc.

There is probably no one solution that is optimal for all of those.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/0ebaa1e3/attachment-0001.html>

From josef.pktd at gmail.com  Tue Apr 25 21:55:52 2017
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Tue, 25 Apr 2017 21:55:52 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
Message-ID: <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>

On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>
>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
>> <chris.barker at noaa.gov> wrote:
>>
>> >> Presumably you're getting byte strings (with  unknown encoding.
>> >
>> > No -- thus is for creating and using mostly ascii string data with
>> > python and numpy.
>> >
>> > Unknown encoding bytes belong in byte arrays -- they are not text.
>>
>> You are welcome to try to convince Thomas of that. That is the status quo
>> for him, but he is finding that difficult to work with.
>>
>> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
>> > with a few extra characters" data. With all the sloppiness over the years,
>> > there are way to many files like that.
>>
>> That sloppiness that you mention is precisely the "unknown encoding"
>> problem. Your previous advocacy has also touched on using latin-1 to decode
>> existing files with unknown encodings as well. If you want to advocate for
>> using latin-1 only for the creation of new data, maybe stop talking about
>> existing files? :-)
>>
>> > Note: the primary use-case I have in mind is working with ascii text in
>> > numpy arrays efficiently-- folks have called for that. All I'm saying is use
>> > Latin-1 instead of ascii -- that buys you some useful extra characters.
>>
>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
>> buys you a whole bunch of useful extra characters. ;-)
>>
>> There are several use cases being brought forth here. Some involve file
>> reading, some involve file writing, and some involve in-memory manipulation.
>> Whatever change we make is going to impinge somehow on all of the use cases.
>> If all we do is add a latin-1 dtype for people to use to create new
>> in-memory data, then someone is going to use it to read existing data in
>> unknown or ambiguous encodings.
>
>
>
> The maximum length of an UTF-8 character is 4 bytes, so we could use that to
> size arrays by character length. The advantage over UTF-32 is that it is
> easily compressible, probably by a factor of 4 in many cases. That doesn't
> solve the in memory problem, but does have some advantages on disk as well
> as making for easy display. We could compress it ourselves after encoding by
> truncation.
>
> Note that for terminal display we will want something supported by the
> system, which is another problem altogether. Let me break the problem down
> into four categories
>
> Storage -- hdf5, .npy, fits, etc.
> Display -- ?
> Modification -- editing
> Parsing -- fits, etc.
>
> There is probably no one solution that is optimal for all of those.
>
> Chuck
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


quoting Julian

'''
I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.
'''

I don't quite understand why this discussion goes in a direction of an
either one XOR the other dtype.

I thought the parameterized 1-byte encoding that Julian mentioned
initially sounds useful to me.

(I'm not sure I will use it much, but I also don't use float16 )

Josef

From aldcroft at head.cfa.harvard.edu  Tue Apr 25 22:02:38 2017
From: aldcroft at head.cfa.harvard.edu (Aldcroft, Thomas)
Date: Tue, 25 Apr 2017 22:02:38 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <-2179002348619298640@unknownmsgid>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CAPJVwBmzMJ3Ptz3+aB+u-ejdmujGm-xDX3rfnx+iKAnAXib4JA@mail.gmail.com>
 <-2179002348619298640@unknownmsgid>
Message-ID: <CAMtEP6y368S07PstBzE8PxB_eT_0U4vz17osx80DLMDMbwDa+g@mail.gmail.com>

On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal <
chris.barker at noaa.gov> wrote:

> > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
> > Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
> > s. And then from in Python, if you want to actually work with those
> filenames you need to either have a bytestring type or else a Unicode type
> that uses surrogateescape to represent the non-ascii characters.
>
>
> > IMO if you have filenames that are arbitrary bytestrings and you need to
> represent this properly, you should just use bytestrings -- really, they're
> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!
>
> And no, bytestrings are not perfectly friendly in py3.
>
> This got really complicated and sidetracked, but All I'm suggesting is
> that if we have a 1byte per char string type, with a fixed encoding,
> that that encoding be Latin-1, rather than ASCII.
>
> That's it, really.
>

Fully agreed.


>
> Having a settable encoding would work fine, too.
>

Yup.

At a simple level, I just want the things that currently work just fine in
Py2 to start working in Py3. That includes being able to read / manipulate
/ compute and write back to legacy binary FITS and HDF5 files that include
ASCII-ish text data (not strictly ASCII).  Memory mapping such files should
be supportable.  Swapping type from bytes to a 1-byte char str should be
possible without altering data in memory.

BTW, I am saying "I want", but this functionality would definitely be
welcome in astropy.  I wrote a unicode sandwich workaround for the astropy
Table class (https://github.com/astropy/astropy/pull/5700) which should be
in the next release.  It would be way better to have this at a level lower
in numpy.

- Tom


>
> -CHB
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/e7c382f1/attachment.html>

From robert.kern at gmail.com  Wed Apr 26 00:20:46 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 25 Apr 2017 21:20:46 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
Message-ID: <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>

On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <charlesr.harris at gmail.com>
wrote:

> The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies
the width in bytes, not Unicode characters.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/6817b4d3/attachment.html>

From shoyer at gmail.com  Wed Apr 26 01:19:22 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 26 Apr 2017 05:19:22 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>
Message-ID: <CAEQ_Tvd2HXU17Z+43=P+vgh0AiQvA9g5AXy15yS-PhnunTwEbA@mail.gmail.com>

On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases. That
> doesn't solve the in memory problem, but does have some advantages on disk
> as well as making for easy display. We could compress it ourselves after
> encoding by truncation.
>
> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
>

It's not just HDF5. Counting bytes is the Right Way to measure the size of
UTF-8 encoded text:
http://utf8everywhere.org/#myths

I also firmly believe (though clearly this is not universally agreed upon)
that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/5beae8ef/attachment-0001.html>

From jtaylor.debian at googlemail.com  Wed Apr 26 05:15:36 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Wed, 26 Apr 2017 11:15:36 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
Message-ID: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>

On 26.04.2017 03:55, josef.pktd at gmail.com wrote:
> On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>>
>>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
>>> <chris.barker at noaa.gov> wrote:
>>>
>>>>> Presumably you're getting byte strings (with  unknown encoding.
>>>>
>>>> No -- thus is for creating and using mostly ascii string data with
>>>> python and numpy.
>>>>
>>>> Unknown encoding bytes belong in byte arrays -- they are not text.
>>>
>>> You are welcome to try to convince Thomas of that. That is the status quo
>>> for him, but he is finding that difficult to work with.
>>>
>>>> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
>>>> with a few extra characters" data. With all the sloppiness over the years,
>>>> there are way to many files like that.
>>>
>>> That sloppiness that you mention is precisely the "unknown encoding"
>>> problem. Your previous advocacy has also touched on using latin-1 to decode
>>> existing files with unknown encodings as well. If you want to advocate for
>>> using latin-1 only for the creation of new data, maybe stop talking about
>>> existing files? :-)
>>>
>>>> Note: the primary use-case I have in mind is working with ascii text in
>>>> numpy arrays efficiently-- folks have called for that. All I'm saying is use
>>>> Latin-1 instead of ascii -- that buys you some useful extra characters.
>>>
>>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
>>> buys you a whole bunch of useful extra characters. ;-)
>>>
>>> There are several use cases being brought forth here. Some involve file
>>> reading, some involve file writing, and some involve in-memory manipulation.
>>> Whatever change we make is going to impinge somehow on all of the use cases.
>>> If all we do is add a latin-1 dtype for people to use to create new
>>> in-memory data, then someone is going to use it to read existing data in
>>> unknown or ambiguous encodings.
>>
>>
>>
>> The maximum length of an UTF-8 character is 4 bytes, so we could use that to
>> size arrays by character length. The advantage over UTF-32 is that it is
>> easily compressible, probably by a factor of 4 in many cases. That doesn't
>> solve the in memory problem, but does have some advantages on disk as well
>> as making for easy display. We could compress it ourselves after encoding by
>> truncation.
>>
>> Note that for terminal display we will want something supported by the
>> system, which is another problem altogether. Let me break the problem down
>> into four categories
>>
>> Storage -- hdf5, .npy, fits, etc.
>> Display -- ?
>> Modification -- editing
>> Parsing -- fits, etc.
>>
>> There is probably no one solution that is optimal for all of those.
>>
>> Chuck
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> 
> 
> quoting Julian
> 
> '''
> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
> 
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
> '''
> 
> I don't quite understand why this discussion goes in a direction of an
> either one XOR the other dtype.
> 
> I thought the parameterized 1-byte encoding that Julian mentioned
> initially sounds useful to me.
> 
> (I'm not sure I will use it much, but I also don't use float16 )
> 
> Josef

Indeed,
Most of this discussion is irrelevant to numpy.
Numpy only really deals with the in memory storage of strings. And in
that it is limited to fixed length strings (in bytes/codepoints).
How you get your messy strings into numpy arrays is not very relevant to
the discussion of a smaller representation of strings.
You couldn't get messy strings into numpy without first sorting it out
yourself before, you won't be able to afterwards.
Numpy will offer a set of encodings, the user chooses which one is best
for the use case and if the user screws it up, it is not numpy's problem.

You currently only have a few ways to even construct string arrays:
- array construction and loops
- genfromtxt (which is again just a loop)
- memory mapping which I seriously doubt anyone actually does for the S
and U dtype

Having a new dtype changes nothing here. You still need to create numpy
arrays from python strings which are well defined and clean.
If you put something in that doesn't encode you get an encoding error.
No oddities like surrogate escapes are needed, numpy arrays are not
interfaces to operating systems nor does numpy need to _add_ support for
historical oddities beyond what it already has.
If you want to represent bytes exactly as they came in don't use a text
dtype (which includes the S dtype, use i1).

Concerning variable sized strings, this is simply not going to happen.
Nobody is going to rewrite numpy to support it, especially not just for
something as unimportant as strings.
Best you are going to get (or better already have) is object arrays. It
makes no sense to discuss it unless someone comes up with an actual
proposal and the willingness to code it.


What is a relevant discussion is whether we really need a more compact
but limited representation of text than 4-byte utf32 at all.
Its usecase is for the most part just for python3 porting and saving
some memory in some ascii heavy cases, e.g. astronomy.
It is not that significant anymore as porting to python3 has mostly
already happened via the ugly byte workaround and memory saving is
probably not as significant in the context of numpy which is already
heavy on memory usage.

My initial approach was to not add a new dtype but to make unicode
parametrizable which would have meant almost no cluttering of numpys
internals and keeping the api more or less consistent which would make
this a relatively simple addition of minor functionality for people that
want it.
But adding a completely new partially redundant dtype for this usecase
may be a too large change to the api. Having two partially redundant
string types may confuse users more than our current status quo of our
single string type (U).

Discussing whether we want to support truncated utf8 has some merit as
it is a decision whether to give the users an even larger gun to shot
themselves in the foot with.
But I'd like to focus first on the 1 byte type to add a symmetric API
for python2 and python3.
utf8 can always be added latter should we deem it a good idea.

cheers,
Julian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/60f0706e/attachment.sig>

From peridot.faceted at gmail.com  Wed Apr 26 06:27:00 2017
From: peridot.faceted at gmail.com (Anne Archibald)
Date: Wed, 26 Apr 2017 10:27:00 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_Tvd2HXU17Z+43=P+vgh0AiQvA9g5AXy15yS-PhnunTwEbA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>
 <CAEQ_Tvd2HXU17Z+43=P+vgh0AiQvA9g5AXy15yS-PhnunTwEbA@mail.gmail.com>
Message-ID: <CANm_+ZoTWVZpc4SRG5NrVRXbusRBaWyme+d7RGxwqxjry5cEvQ@mail.gmail.com>

On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern at gmail.com> wrote:
>
>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>>
>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
>> that to size arrays by character length. The advantage over UTF-32 is that
>> it is easily compressible, probably by a factor of 4 in many cases. That
>> doesn't solve the in memory problem, but does have some advantages on disk
>> as well as making for easy display. We could compress it ourselves after
>> encoding by truncation.
>>
>> The major use case that we have for a UTF-8 array is HDF5, and it
>> specifies the width in bytes, not Unicode characters.
>>
>
> It's not just HDF5. Counting bytes is the Right Way to measure the size of
> UTF-8 encoded text:
> http://utf8everywhere.org/#myths
>
> I also firmly believe (though clearly this is not universally agreed upon)
> that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications. So if we're adding any new string encodings, it needs to be
> one of them.
>

It seems to me that most of the requirements people have expressed in this
thread would be satisfied by:

(1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)

(2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All
python encodings should be permitted. An additional function to truncate
encoded data without mangling the encoding would be handy. I think it makes
more sense for this to be NULL-padded than NULL-terminated but it may be
necessary to support both; note that NULL-termination is complicated for
encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.

(3) a dtype for fixed-length byte strings. This doesn't look very different
from an array of dtype u8, but given we have the bytes type, accessing the
data this way makes sense.

There seems to be considerable debate about what the "default" string type
should be,  but since users must specify a length anyway, might as well
force them to specify an encoding and thus dodge the debate about the right
default.

The other question - which I realize is how the thread started - is what to
do about backward compatibility. I'm not writing the code, so my opinion
doesn't matter much, but I think we're stuck maintaining what we have now -
ASCII and UCS4 strings - for a while yet. But it can be deprecated, or they
can be simply reimplemented as shorthand names for ASCII- or UCS4-encoded
strings in the bytes-with-encoding dtype.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/23772c1a/attachment-0001.html>

From charlesr.harris at gmail.com  Wed Apr 26 11:19:13 2017
From: charlesr.harris at gmail.com (Charles R Harris)
Date: Wed, 26 Apr 2017 09:19:13 -0600
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
Message-ID: <CAB6mnxJxBm6X_5GQgrWJJYAMktWMX3uNO7Rxa3Hc62FxcbgGTw@mail.gmail.com>

On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> On 26.04.2017 03:55, josef.pktd at gmail.com wrote:
> > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
> > <charlesr.harris at gmail.com> wrote:
> >>
> >>
> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>>
> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
> >>> <chris.barker at noaa.gov> wrote:
> >>>
> >>>>> Presumably you're getting byte strings (with  unknown encoding.
> >>>>
> >>>> No -- thus is for creating and using mostly ascii string data with
> >>>> python and numpy.
> >>>>
> >>>> Unknown encoding bytes belong in byte arrays -- they are not text.
> >>>
> >>> You are welcome to try to convince Thomas of that. That is the status
> quo
> >>> for him, but he is finding that difficult to work with.
> >>>
> >>>> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
> >>>> with a few extra characters" data. With all the sloppiness over the
> years,
> >>>> there are way to many files like that.
> >>>
> >>> That sloppiness that you mention is precisely the "unknown encoding"
> >>> problem. Your previous advocacy has also touched on using latin-1 to
> decode
> >>> existing files with unknown encodings as well. If you want to advocate
> for
> >>> using latin-1 only for the creation of new data, maybe stop talking
> about
> >>> existing files? :-)
> >>>
> >>>> Note: the primary use-case I have in mind is working with ascii text
> in
> >>>> numpy arrays efficiently-- folks have called for that. All I'm saying
> is use
> >>>> Latin-1 instead of ascii -- that buys you some useful extra
> characters.
> >>>
> >>> For that use case, the alternative in play isn't ASCII, it's UTF-8,
> which
> >>> buys you a whole bunch of useful extra characters. ;-)
> >>>
> >>> There are several use cases being brought forth here. Some involve file
> >>> reading, some involve file writing, and some involve in-memory
> manipulation.
> >>> Whatever change we make is going to impinge somehow on all of the use
> cases.
> >>> If all we do is add a latin-1 dtype for people to use to create new
> >>> in-memory data, then someone is going to use it to read existing data
> in
> >>> unknown or ambiguous encodings.
> >>
> >>
> >>
> >> The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to
> >> size arrays by character length. The advantage over UTF-32 is that it is
> >> easily compressible, probably by a factor of 4 in many cases. That
> doesn't
> >> solve the in memory problem, but does have some advantages on disk as
> well
> >> as making for easy display. We could compress it ourselves after
> encoding by
> >> truncation.
> >>
> >> Note that for terminal display we will want something supported by the
> >> system, which is another problem altogether. Let me break the problem
> down
> >> into four categories
> >>
> >> Storage -- hdf5, .npy, fits, etc.
> >> Display -- ?
> >> Modification -- editing
> >> Parsing -- fits, etc.
> >>
> >> There is probably no one solution that is optimal for all of those.
> >>
> >> Chuck
> >>
> >>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at python.org
> >> https://mail.python.org/mailman/listinfo/numpy-discussion
> >>
> >
> >
> > quoting Julian
> >
> > '''
> > I probably have formulated my goal with the proposal a bit better, I am
> > not very interested in a repetition of which encoding to use debate.
> > In the end what will be done allows any encoding via a dtype with
> > metadata like datetime.
> > This allows any codec (including truncated utf8) to be added easily (if
> > python supports it) and allows sidestepping the debate.
> >
> > My main concern is whether it should be a new dtype or modifying the
> > unicode dtype. Though the backward compatibility argument is strongly in
> > favour of adding a new dtype that makes the np.unicode type redundant.
> > '''
> >
> > I don't quite understand why this discussion goes in a direction of an
> > either one XOR the other dtype.
> >
> > I thought the parameterized 1-byte encoding that Julian mentioned
> > initially sounds useful to me.
> >
> > (I'm not sure I will use it much, but I also don't use float16 )
> >
> > Josef
>
> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype
>
> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).
>
> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.
>
>
> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.
>

I think we can implement viewers for strings as ndarray subclasses. Then
one could
do `my_string_array.view(latin_1)`, and so on.  Essentially that just
changes the default
encoding of the 'S' array. That could also work for uint8 arrays if needed.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/266d6591/attachment.html>

From wieser.eric+numpy at gmail.com  Wed Apr 26 11:39:46 2017
From: wieser.eric+numpy at gmail.com (Eric Wieser)
Date: Wed, 26 Apr 2017 16:39:46 +0100
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAB6mnxJxBm6X_5GQgrWJJYAMktWMX3uNO7Rxa3Hc62FxcbgGTw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAB6mnxJxBm6X_5GQgrWJJYAMktWMX3uNO7Rxa3Hc62FxcbgGTw@mail.gmail.com>
Message-ID: <CAL1kJvCb+epA+2Rnm3uqtRsQVbz7zdaWx534YNtwxwWjx+Jysw@mail.gmail.com>

> I think we can implement viewers for strings as ndarray subclasses. Then one
> could
> do `my_string_array.view(latin_1)`, and so on.  Essentially that just
> changes the default
> encoding of the 'S' array. That could also work for uint8 arrays if needed.
>
> Chuck

To handle structured data-types containing encoded strings, we'd also
need to subclass `np.void`.

Things would get messy when a structured dtype contains two strings in
different encodings (or more likely, one bytestring and one
textstring) - we'd need some way to specify which fields are in which
encoding, and using subclasses means that this can't be contained
within the dtype information.

So I think there's a strong argument for solving this with`dtype`s
rather than subclasses. This really doesn't seem hard though.
Something like (C-but-as-python):

def ENCSTRING_getitem(ptr, arr):  # The PyArrFuncs slot
    encoded = STRING_getitem(ptr, arr)
    return encoded.decode(arr.dtype.encoding)

def ENCSTRING_setitem(val, ptr, arr):  # The PyArrFuncs slot
    val = val.encode(arr.dtype.encoding)
    # todo: handle "safe" truncation, where safe might mean keep
codepoints, keep graphemes, or never allow
    STRING_setitem(val, ptr, arr))

We'd probably need to be careful to do a decode/encode dance when
copying from one encoding to another, but we [already have
bugs](https://github.com/numpy/numpy/issues/3258) in those cases
anyway.

Is it reasonable that the user of such an array would want to work
with plain `builtin.unicode` objects, rather than some special numpy
scalar type?

Eric

From chris.barker at noaa.gov  Wed Apr 26 12:28:48 2017
From: chris.barker at noaa.gov (Chris Barker - NOAA Federal)
Date: Wed, 26 Apr 2017 09:28:48 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
Message-ID: <-5378706506035339722@unknownmsgid>

> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding" problem.

Exactly -- but from a practicality beats purity perspective, there is
a difference between "I have no idea whatsoever" and "I know it is
mostly ascii, and European, but there are some extra characters in
there"

Latin-1 had proven very useful for that case.

I suppose in most cases ascii with errors='replace' would be a good
choice, but I'd still rather not throw out potentially useful
information.

> Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)

Yeah, I've been very unfocused in this discussion -- sorry about that.

> > Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)

UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.

It's become clear in this discussion that there is s strong desire to
support a numpy dtype that stores text in particular binary formats
(I.e. Encodings). Rather than choose one or two, we might as well
support all encodings supported by python.

In that case, we'll have utf-8 for those that know they want that, and
we'll have latin-1 for those that incorrectly think they want that :-)

So what remains is to decide is implementation, syntax, and defaults.

Let's keep in mind that most of us on this list, and in this
discussion, are the folks that write interface code and the like. But
most numpy users are not as tuned in to the internals. So defaults
should be set to best support the more "naive" user.

> . If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.

If we add every encoding known to man someone is going to use Latin-1
to read unknown encodings. Indeed, as we've all pointed out, there is
no correct encoding with which to read unknown encodings.

Frankly, if we have UTF-8 under the hood, I think people are even MORE
likely to use it inappropriately-- it's quite scary how many people
think UTF-8 == Unicode, and think all you need to do is "use utf-8",
and you don't need to change any of the rest of your code. Oh, and
once you've done that, you can use your existing ASCII-only tests and
think you have a working application :-)

-CHB

From robert.kern at gmail.com  Wed Apr 26 13:08:01 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 10:08:01 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
Message-ID: <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>

On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype

I fear that you decided that the discussion was irrelevant and thus did not
read it rather than reading it to decide that it was not relevant. Because
several of us have showed that, yes indeed, we do memory-map string arrays.

You can add to this list C APIs, like that of libhdf5, that need to
communicate (Unicode) string arrays.

Look, I know I can be tedious, but *please* go back and read this
discussion. We have concrete use cases outlined. We can give you more
details if you need them. We all feel the pain of the rushed, inadequate
implementation of the U dtype. But each of our pains is a little bit
different; you obviously aren't experiencing the same pains that I am.

> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).

Thomas Aldcroft has demonstrated the problem with this approach. numpy
arrays are often interfaces to files that have tons of historical oddities.

> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.

No one has suggested such a thing. At most, we've talked about specializing
object arrays.

> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.

What is your current proposal? A string dtype parameterized with the
encoding (initially supporting the latin-1 that you desire and maybe adding
utf-8 later)? Or a latin-1-specific dtype such that we will have to add a
second utf-8 dtype at a later date?

If you're not going to support arbitrary encodings right off the bat, I'd
actually suggest implementing UTF-8 and ASCII-surrogateescape first as they
seem to knock off more use cases straight away.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/afb60996/attachment.html>

From jtaylor.debian at googlemail.com  Wed Apr 26 13:43:47 2017
From: jtaylor.debian at googlemail.com (Julian Taylor)
Date: Wed, 26 Apr 2017 19:43:47 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
Message-ID: <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>

On 26.04.2017 19:08, Robert Kern wrote:
> On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>> Indeed,
>> Most of this discussion is irrelevant to numpy.
>> Numpy only really deals with the in memory storage of strings. And in
>> that it is limited to fixed length strings (in bytes/codepoints).
>> How you get your messy strings into numpy arrays is not very relevant to
>> the discussion of a smaller representation of strings.
>> You couldn't get messy strings into numpy without first sorting it out
>> yourself before, you won't be able to afterwards.
>> Numpy will offer a set of encodings, the user chooses which one is best
>> for the use case and if the user screws it up, it is not numpy's problem.
>>
>> You currently only have a few ways to even construct string arrays:
>> - array construction and loops
>> - genfromtxt (which is again just a loop)
>> - memory mapping which I seriously doubt anyone actually does for the S
>> and U dtype
> 
> I fear that you decided that the discussion was irrelevant and thus did
> not read it rather than reading it to decide that it was not relevant.
> Because several of us have showed that, yes indeed, we do memory-map
> string arrays.
> 
> You can add to this list C APIs, like that of libhdf5, that need to
> communicate (Unicode) string arrays.
> 
> Look, I know I can be tedious, but *please* go back and read this
> discussion. We have concrete use cases outlined. We can give you more
> details if you need them. We all feel the pain of the rushed, inadequate
> implementation of the U dtype. But each of our pains is a little bit
> different; you obviously aren't experiencing the same pains that I am.

I have read every mail and it has been a large waste of time, Everything
has been said already many times in the last few years.
Even if you memory map string arrays, of which I have not seen a
concrete use case in the mails beyond "would be nice to have" without
any backing in actual code, but I may have missed it.
In any case it is still irrelevant. My proposal only _adds_ additional
cases that can be mmapped. It does not prevent you from doing what you
have been doing before.

> 
>> Having a new dtype changes nothing here. You still need to create numpy
>> arrays from python strings which are well defined and clean.
>> If you put something in that doesn't encode you get an encoding error.
>> No oddities like surrogate escapes are needed, numpy arrays are not
>> interfaces to operating systems nor does numpy need to _add_ support for
>> historical oddities beyond what it already has.
>> If you want to represent bytes exactly as they came in don't use a text
>> dtype (which includes the S dtype, use i1).
> 
> Thomas Aldcroft has demonstrated the problem with this approach. numpy
> arrays are often interfaces to files that have tons of historical oddities.

This does not matter for numpy, the text dtype is well defined as bytes
with a specific encoding and null padding. If you have an historical
oddity that does not fit, do not use the text dtype but use a pure byte
array instead.

> 
>> Concerning variable sized strings, this is simply not going to happen.
>> Nobody is going to rewrite numpy to support it, especially not just for
>> something as unimportant as strings.
>> Best you are going to get (or better already have) is object arrays. It
>> makes no sense to discuss it unless someone comes up with an actual
>> proposal and the willingness to code it.
> 
> No one has suggested such a thing. At most, we've talked about
> specializing object arrays.
> 
>> What is a relevant discussion is whether we really need a more compact
>> but limited representation of text than 4-byte utf32 at all.
>> Its usecase is for the most part just for python3 porting and saving
>> some memory in some ascii heavy cases, e.g. astronomy.
>> It is not that significant anymore as porting to python3 has mostly
>> already happened via the ugly byte workaround and memory saving is
>> probably not as significant in the context of numpy which is already
>> heavy on memory usage.
>>
>> My initial approach was to not add a new dtype but to make unicode
>> parametrizable which would have meant almost no cluttering of numpys
>> internals and keeping the api more or less consistent which would make
>> this a relatively simple addition of minor functionality for people that
>> want it.
>> But adding a completely new partially redundant dtype for this usecase
>> may be a too large change to the api. Having two partially redundant
>> string types may confuse users more than our current status quo of our
>> single string type (U).
>>
>> Discussing whether we want to support truncated utf8 has some merit as
>> it is a decision whether to give the users an even larger gun to shot
>> themselves in the foot with.
>> But I'd like to focus first on the 1 byte type to add a symmetric API
>> for python2 and python3.
>> utf8 can always be added latter should we deem it a good idea.
> 
> What is your current proposal? A string dtype parameterized with the
> encoding (initially supporting the latin-1 that you desire and maybe
> adding utf-8 later)? Or a latin-1-specific dtype such that we will have
> to add a second utf-8 dtype at a later date?

My proposal is a single new parameterizable dtype. Adding multiple
dtypes for each encoding seems unnecessary to me given that numpy
already supports parameterizable types.
For example datetime is very similar, it is basically encoded integers.
There are multiple encodings = units supported.

> 
> If you're not going to support arbitrary encodings right off the bat,
> I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first
> as they seem to knock off more use cases straight away.
> 


Please list the use cases in the context of numpy usage. hdf5 is the
most obvious, but how exactly would hdf5 use an utf8 array in the actual
implementation?

What you save by having utf8 in the numpy array is replacing a decoding
ane encoding step with a stripping null padding step.
That doesn't seem very worthwhile compared to all their other overheads
involved.

From robert.kern at gmail.com  Wed Apr 26 13:45:20 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 10:45:20 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CANm_+ZoTWVZpc4SRG5NrVRXbusRBaWyme+d7RGxwqxjry5cEvQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>
 <CAEQ_Tvd2HXU17Z+43=P+vgh0AiQvA9g5AXy15yS-PhnunTwEbA@mail.gmail.com>
 <CANm_+ZoTWVZpc4SRG5NrVRXbusRBaWyme+d7RGxwqxjry5cEvQ@mail.gmail.com>
Message-ID: <CAF6FJivH7K2FkBE0nQBxSaRt=sSx+vGrFDxyx0eT8J+f216+cw@mail.gmail.com>

On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:
>
> On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern at gmail.com>
wrote:
>>>
>>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>>>
>>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
that to size arrays by character length. The advantage over UTF-32 is that
it is easily compressible, probably by a factor of 4 in many cases. That
doesn't solve the in memory problem, but does have some advantages on disk
as well as making for easy display. We could compress it ourselves after
encoding by truncation.
>>>
>>> The major use case that we have for a UTF-8 array is HDF5, and it
specifies the width in bytes, not Unicode characters.
>>
>> It's not just HDF5. Counting bytes is the Right Way to measure the size
of UTF-8 encoded text:
>> http://utf8everywhere.org/#myths
>>
>> I also firmly believe (though clearly this is not universally agreed
upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
>
> It seems to me that most of the requirements people have expressed in
this thread would be satisfied by:
>
> (1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)
>
> (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
All python encodings should be permitted. An additional function to
truncate encoded data without mangling the encoding would be handy. I think
it makes more sense for this to be NULL-padded than NULL-terminated but it
may be necessary to support both; note that NULL-termination is complicated
for encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.
>
> (3) a dtype for fixed-length byte strings. This doesn't look very
different from an array of dtype u8, but given we have the bytes type,
accessing the data this way makes sense.

The void dtype is already there for this general purpose and mostly works,
with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the
scalars (fortunately, they do not appear to be mutable views). It also
accepts `bytes` strings that are too short (pads with NULs) and too long
(truncates). If it worked more transparently and perhaps rigorously with
`bytes`, then it would be quite suitable.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/fcd36b48/attachment.html>

From njs at pobox.com  Wed Apr 26 14:31:00 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Wed, 26 Apr 2017 11:31:00 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <-5378706506035339722@unknownmsgid>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
Message-ID: <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>

On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <
chris.barker at noaa.gov> wrote:


UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.


This seems a little vague? The "character-oriented Python text model" is
just that str supports O(1) indexing of characters. But... Numpy doesn't.
If you want to access individual characters inside a string inside an
array, you have to pull out the scalar first, at which point the data is
copied and boxed into a Python object anyway, using whatever representation
the interpreter prefers. So AFAICT? it makes literally no difference to the
user whether numpy's internal representation allows for fast character
access.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/a089f715/attachment.html>

From josef.pktd at gmail.com  Wed Apr 26 15:03:33 2017
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Wed, 26 Apr 2017 15:03:33 -0400
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
Message-ID: <CAMMTP+A7qK0C+eJ_ZsSxu-v0_WR0VEs2OgWP8N5Gh4WsqT099w@mail.gmail.com>

On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal"
> <chris.barker at noaa.gov> wrote:
>
>
> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague? The "character-oriented Python text model" is
> just that str supports O(1) indexing of characters. But... Numpy doesn't. If
> you want to access individual characters inside a string inside an array,
> you have to pull out the scalar first, at which point the data is copied and
> boxed into a Python object anyway, using whatever representation the
> interpreter prefers. So AFAICT it makes literally no difference to the user
> whether numpy's internal representation allows for fast character access.

you can create a view on individual characters or bytes, AFAICS

>>> t = np.array(['abcdefg']*10)
>>> t2 = t.view([('s%d' % i, '<U1') for i in range(7)])
>>> t2['s5']
array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'],
      dtype='<U1')


>>> t.view('<U1').reshape(len(t), -1)[:, 2]
array(['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c'],
      dtype='<U1')


Josef

>
> -n
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

From sebastian at sipsolutions.net  Wed Apr 26 14:38:09 2017
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Wed, 26 Apr 2017 20:38:09 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
Message-ID: <1493231889.17161.3.camel@sipsolutions.net>

On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote:
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.co
> > m>>
> > wrote:
> > 
> > > Indeed,
> > > Most of this discussion is irrelevant to numpy.
> > > Numpy only really deals with the in memory storage of strings.
> > > And in
> > > that it is limited to fixed length strings (in bytes/codepoints).
> > > How you get your messy strings into numpy arrays is not very
> > > relevant to
> > > the discussion of a smaller representation of strings.
> > > You couldn't get messy strings into numpy without first sorting
> > > it out
> > > yourself before, you won't be able to afterwards.
> > > Numpy will offer a set of encodings, the user chooses which one
> > > is best
> > > for the use case and if the user screws it up, it is not numpy's
> > > problem.
> > > 
> > > You currently only have a few ways to even construct string
> > > arrays:
> > > - array construction and loops
> > > - genfromtxt (which is again just a loop)
> > > - memory mapping which I seriously doubt anyone actually does for
> > > the S
> > > and U dtype
> > 
> > I fear that you decided that the discussion was irrelevant and thus
> > did
> > not read it rather than reading it to decide that it was not
> > relevant.
> > Because several of us have showed that, yes indeed, we do memory-
> > map
> > string arrays.
> > 
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> > 
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you
> > more
> > details if you need them. We all feel the pain of the rushed,
> > inadequate
> > implementation of the U dtype. But each of our pains is a little
> > bit
> > different; you obviously aren't experiencing the same pains that I
> > am.
> 
> I have read every mail and it has been a large waste of time,
> Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.
> In any case it is still irrelevant. My proposal only _adds_
> additional
> cases that can be mmapped. It does not prevent you from doing what
> you
> have been doing before.
> 
> > 
> > > Having a new dtype changes nothing here. You still need to create
> > > numpy
> > > arrays from python strings which are well defined and clean.
> > > If you put something in that doesn't encode you get an encoding
> > > error.
> > > No oddities like surrogate escapes are needed, numpy arrays are
> > > not
> > > interfaces to operating systems nor does numpy need to _add_
> > > support for
> > > historical oddities beyond what it already has.
> > > If you want to represent bytes exactly as they came in don't use
> > > a text
> > > dtype (which includes the S dtype, use i1).
> > 
> > Thomas Aldcroft has demonstrated the problem with this approach.
> > numpy
> > arrays are often interfaces to files that have tons of historical
> > oddities.
> 
> This does not matter for numpy, the text dtype is well defined as
> bytes
> with a specific encoding and null padding. If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure
> byte
> array instead.
> 
> > 
> > > Concerning variable sized strings, this is simply not going to
> > > happen.
> > > Nobody is going to rewrite numpy to support it, especially not
> > > just for
> > > something as unimportant as strings.
> > > Best you are going to get (or better already have) is object
> > > arrays. It
> > > makes no sense to discuss it unless someone comes up with an
> > > actual
> > > proposal and the willingness to code it.
> > 
> > No one has suggested such a thing. At most, we've talked about
> > specializing object arrays.
> > 
> > > What is a relevant discussion is whether we really need a more
> > > compact
> > > but limited representation of text than 4-byte utf32 at all.
> > > Its usecase is for the most part just for python3 porting and
> > > saving
> > > some memory in some ascii heavy cases, e.g. astronomy.
> > > It is not that significant anymore as porting to python3 has
> > > mostly
> > > already happened via the ugly byte workaround and memory saving
> > > is
> > > probably not as significant in the context of numpy which is
> > > already
> > > heavy on memory usage.
> > > 
> > > My initial approach was to not add a new dtype but to make
> > > unicode
> > > parametrizable which would have meant almost no cluttering of
> > > numpys
> > > internals and keeping the api more or less consistent which would
> > > make
> > > this a relatively simple addition of minor functionality for
> > > people that
> > > want it.
> > > But adding a completely new partially redundant dtype for this
> > > usecase
> > > may be a too large change to the api. Having two partially
> > > redundant
> > > string types may confuse users more than our current status quo
> > > of our
> > > single string type (U).
> > > 
> > > Discussing whether we want to support truncated utf8 has some
> > > merit as
> > > it is a decision whether to give the users an even larger gun to
> > > shot
> > > themselves in the foot with.
> > > But I'd like to focus first on the 1 byte type to add a symmetric
> > > API
> > > for python2 and python3.
> > > utf8 can always be added latter should we deem it a good idea.
> > 
> > What is your current proposal? A string dtype parameterized with
> > the
> > encoding (initially supporting the latin-1 that you desire and
> > maybe
> > adding utf-8 later)? Or a latin-1-specific dtype such that we will
> > have
> > to add a second utf-8 dtype at a later date?
> 
> My proposal is a single new parameterizable dtype. Adding multiple
> dtypes for each encoding seems unnecessary to me given that numpy
> already supports parameterizable types.
> For example datetime is very similar, it is basically encoded
> integers.
> There are multiple encodings = units supported.
> 
> > 
> > If you're not going to support arbitrary encodings right off the
> > bat,
> > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape
> > first
> > as they seem to knock off more use cases straight away.
> > 
> 
> 
> Please list the use cases in the context of numpy usage. hdf5 is the
> most obvious, but how exactly would hdf5 use an utf8 array in the
> actual
> implementation?
> 
> What you save by having utf8 in the numpy array is replacing a
> decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other
> overheads
> involved.

I remember talking with a colleague about something like that. And
basically an annoying thing there was that if you strip the zero bytes
in a zero padded string, some encodings (UTF16) may need one of the
zero bytes to work right. (I think she got around it, by weird
trickery, inverting the endianess or so and thus putting the zero bytes
first).
Maybe will ask her if this discussion is interesting to her. Though I
think it might have been something like "make everything in
hdf5/something similar work" without any actual use case, I don't know.

Have not read the thread, I think a fixed byte but settable encoding
type would make sense. I personally wonder whether storing the length
might make sense, even if that removes direct memory mapping, but as
you said, you can still memmap the bytes, and then probably just cast
back and forth.

Sorry if there is zero actual input here :)

- Sebastian


> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/dbb8d9ee/attachment-0001.sig>

From robert.kern at gmail.com  Wed Apr 26 15:07:47 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 12:07:47 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
Message-ID: <CAF6FJitXRX546S3BFoy8W4QJmUVbnhubKhxb-8KuF6hLFP8reQ@mail.gmail.com>

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
>
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> > wrote:
> >
> >> Indeed,
> >> Most of this discussion is irrelevant to numpy.
> >> Numpy only really deals with the in memory storage of strings. And in
> >> that it is limited to fixed length strings (in bytes/codepoints).
> >> How you get your messy strings into numpy arrays is not very relevant
to
> >> the discussion of a smaller representation of strings.
> >> You couldn't get messy strings into numpy without first sorting it out
> >> yourself before, you won't be able to afterwards.
> >> Numpy will offer a set of encodings, the user chooses which one is best
> >> for the use case and if the user screws it up, it is not numpy's
problem.
> >>
> >> You currently only have a few ways to even construct string arrays:
> >> - array construction and loops
> >> - genfromtxt (which is again just a loop)
> >> - memory mapping which I seriously doubt anyone actually does for the S
> >> and U dtype
> >
> > I fear that you decided that the discussion was irrelevant and thus did
> > not read it rather than reading it to decide that it was not relevant.
> > Because several of us have showed that, yes indeed, we do memory-map
> > string arrays.
> >
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> >
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you more
> > details if you need them. We all feel the pain of the rushed, inadequate
> > implementation of the U dtype. But each of our pains is a little bit
> > different; you obviously aren't experiencing the same pains that I am.
>
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:


https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".

> In any case it is still irrelevant. My proposal only _adds_ additional
> cases that can be mmapped. It does not prevent you from doing what you
> have been doing before.

You are the one who keeps worrying about the additional complexity, both in
code and mental capacity of our users, of adding new overlapping dtypes and
solutions, and you're not wrong about that. I think it behooves us to
consider if there are solutions that solve multiple related problems at
once instead of adding new dtypes piecemeal to solve individual problems.

> >> Having a new dtype changes nothing here. You still need to create numpy
> >> arrays from python strings which are well defined and clean.
> >> If you put something in that doesn't encode you get an encoding error.
> >> No oddities like surrogate escapes are needed, numpy arrays are not
> >> interfaces to operating systems nor does numpy need to _add_ support
for
> >> historical oddities beyond what it already has.
> >> If you want to represent bytes exactly as they came in don't use a text
> >> dtype (which includes the S dtype, use i1).
> >
> > Thomas Aldcroft has demonstrated the problem with this approach. numpy
> > arrays are often interfaces to files that have tons of historical
oddities.
>
> This does not matter for numpy, the text dtype is well defined as bytes
> with a specific encoding and null padding.

You cannot dismiss something as "not mattering for *numpy*" just because
your new, *proposed* text dtype doesn't support it.

You seem to have fixed on a course of action and are defining everyone
else's use cases as out-of-scope because your course of action doesn't
support them. That's backwards. Define the use cases first, determine the
requirements, then build a solution that meets those requirements. We
skipped those steps before, and that's why we're all feeling the pain.

> If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure byte
> array instead.

That's his status quo, and he finds it unworkable. Now, I have proposed a
way out of that by supporting ASCII-surrogateescape as a specific encoding.
It's not an ISO standard encoding, but the surrogeescape mechanism seems to
be what the Python world has settled on for such situations. Would you
support that with your parameterized-encoding text dtype?

> >> Concerning variable sized strings, this is simply not going to happen.
> >> Nobody is going to rewrite numpy to support it, especially not just for
> >> something as unimportant as strings.
> >> Best you are going to get (or better already have) is object arrays. It
> >> makes no sense to discuss it unless someone comes up with an actual
> >> proposal and the willingness to code it.
> >
> > No one has suggested such a thing. At most, we've talked about
> > specializing object arrays.
> >
> >> What is a relevant discussion is whether we really need a more compact
> >> but limited representation of text than 4-byte utf32 at all.
> >> Its usecase is for the most part just for python3 porting and saving
> >> some memory in some ascii heavy cases, e.g. astronomy.
> >> It is not that significant anymore as porting to python3 has mostly
> >> already happened via the ugly byte workaround and memory saving is
> >> probably not as significant in the context of numpy which is already
> >> heavy on memory usage.
> >>
> >> My initial approach was to not add a new dtype but to make unicode
> >> parametrizable which would have meant almost no cluttering of numpys
> >> internals and keeping the api more or less consistent which would make
> >> this a relatively simple addition of minor functionality for people
that
> >> want it.
> >> But adding a completely new partially redundant dtype for this usecase
> >> may be a too large change to the api. Having two partially redundant
> >> string types may confuse users more than our current status quo of our
> >> single string type (U).
> >>
> >> Discussing whether we want to support truncated utf8 has some merit as
> >> it is a decision whether to give the users an even larger gun to shot
> >> themselves in the foot with.
> >> But I'd like to focus first on the 1 byte type to add a symmetric API
> >> for python2 and python3.
> >> utf8 can always be added latter should we deem it a good idea.
> >
> > What is your current proposal? A string dtype parameterized with the
> > encoding (initially supporting the latin-1 that you desire and maybe
> > adding utf-8 later)? Or a latin-1-specific dtype such that we will have
> > to add a second utf-8 dtype at a later date?
>
> My proposal is a single new parameterizable dtype. Adding multiple
> dtypes for each encoding seems unnecessary to me given that numpy
> already supports parameterizable types.
> For example datetime is very similar, it is basically encoded integers.
> There are multiple encodings = units supported.

Okay great. What encodings are you intending to support? You seem to be
pushing against supporting UTF-8.

> > If you're not going to support arbitrary encodings right off the bat,
> > I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first
> > as they seem to knock off more use cases straight away.
>
> Please list the use cases in the context of numpy usage. hdf5 is the
> most obvious, but how exactly would hdf5 use an utf8 array in the actual
> implementation?

File reading:

The user requests data from a fixed-width UTF-8 Dataset. E.g. h5py:

  >>> a = h5['/some_utf8_array'][:]

h5py looks at the Dataset's shape (with the fixed width defined in bytes)
and allocates a numpy UTF-8 array with the dtype being given the same
bytewidth as specified by the Dataset. h5py fills in the data quickly in
bulk using libhdf5's efficient APIs for such data movement. The user now
has a numpy array whose scalars come out/go in as `unicode/str` objects.

File writing:

The user needs to create a string Dataset with Unicode characters. A
fixed-width UTF-8 Dataset is preferred (in this case) over HDF5
variable-width Datasets because the latter is not compressible, and the
strings are all reasonably close in size. The user's in-memory data may or
may not be in a UTF-8 array (it might be in an object array of
`unicode/str` string objects or a U-dtype array), but h5py can use numpy's
conversion machinery to turn it into a numpy UTF-8 array (much like it can
accept lists of floats and cast it to a float64 array). It can look at the
UTF-8 array's shape and itemsize to create the corresponding Dataset, and
then pass the array to libhdf5's efficient APIs for copying arrays of data
into a Dataset.

> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/b44c7430/attachment.html>

From robert.kern at gmail.com  Wed Apr 26 15:17:15 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 12:17:15 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <1493231889.17161.3.camel@sipsolutions.net>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
Message-ID: <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>

On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian at sipsolutions.net>
wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right. (I think she got around it, by weird
> trickery, inverting the endianess or so and thus putting the zero bytes
> first).
> Maybe will ask her if this discussion is interesting to her. Though I
> think it might have been something like "make everything in
> hdf5/something similar work" without any actual use case, I don't know.

I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/fe39bdb3/attachment-0001.html>

From chris.barker at noaa.gov  Wed Apr 26 18:27:10 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Wed, 26 Apr 2017 15:27:10 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
Message-ID: <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>

On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith <njs at pobox.com> wrote:

> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague?
>

sorry -- that's what I get for trying to be concise...


> The "character-oriented Python text model" is just that str supports O(1)
> indexing of characters.
>

not really -- I think the performance characteristics are an implementation
detail (though it did influence the design, I'm sure)

I'm referring to the fact that a python string appears (to the user -- also
under the hood, but again, implementation detail)  to be a sequence of
characters, not a sequence of bytes, not a sequence of glyphs, or
graphemes, or anything else. Every Python string has a length, and that
length is the number of characters, and if you index you get a string of
length-1, and it has one character it it, and that character matches to a
code point of a single value.

Someone could implement a python string using utf-8 under the hood, and
none of that would change (and I think micropython may have done that...)

Sure, you might get two characters when you really expect a single
grapheme, but it's at least a consistent oddity. (well, not always, as some
graphemes can be represented by either a single code point or two combined
-- human language really sucks!)

The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point
that a character-oriented interface is not the only one that makes sense,
and may not make sense at all. However:

1) Python has chosen that interface

2) It is a good interface (probably the best for computer use) if you need
to choose only one

utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily
for utf-8 everywhere as the best option for working at the C level. That's
probably true.

(I also think the utf-8 fans are in a bit of a fantasy world -- this would
all be easier, yes, if one encoding was used for everything, all the time,
but other than that, utf-8 is not a Pancea -- we are still going to have
encoding headaches no matter how you slice it)

So where does numpy fit? well, it does operate at the C level, but people
work with it from python, so exposing the details of the encoding to the
user should be strictly opt-in.

When a numpy user wants to put a string into a numpy array, they should
know how long a string they can fit -- with "length" defined how python
strings define it.

Using utf-8 for the default string in numpy would be like using float16 for
default float--not a good idea!

I believe Julian said there would be no default -- you would need to
specify, but I think there does need to be one:

np.array(["a string", "another string"])

needs to do something.

if we make a parameterized dtype that accepts any encoding, then we could
do:

np.array(["a string", "another string"], dtype=no.stringtype["utf-8"])

If folks really want that.

I'm afraid that that would lead to errors -- cool,. utf-8 is just like
ascii, but with full Unicode support!

But... Numpy doesn't. If you want to access individual characters inside a
> string inside an array, you have to pull out the scalar first, at which
> point the data is copied and boxed into a Python object anyway, using
> whatever representation the interpreter prefers.
>


> So AFAICT? it makes literally no difference to the user whether numpy's
> internal representation allows for fast character access.
>

agreed - unless someone wants to do a view that makes a N-D array for
strings look like a 1-D array of characters.... Which seems odd, but there
was recently a big debate on the netcdf CF conventions list about that very
issue...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/1f74bfdc/attachment.html>

From chris.barker at noaa.gov  Wed Apr 26 18:44:03 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Wed, 26 Apr 2017 15:44:03 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <1493231889.17161.3.camel@sipsolutions.net>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
Message-ID: <CALGmxE+=8cUVEnNMd_2Uzu6W07=Xo81-bHOq1WMRyM0RW_TR9g@mail.gmail.com>

On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian at sipsolutions.net
> wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right.


I think it's really clear that you don't want to mess with the bytes in any
way without knowing the encoding -- for UTF-16, the code unit is two bytes,
so a "null" is two zero bytes in a row.

So generic "null padded" or "null terminated" is dangerous -- it would have
to be "Null-padded utf-8" or whatever.

  Though I

> think it might have been something like "make everything in
> hdf5/something similar work"


That would be nice :-), but I suspect HDF-5 is the same as everything else
-- there are files in the wild where someone jammed the wrong thing into a
text array ....

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/0ceba05e/attachment.html>

From chris.barker at noaa.gov  Wed Apr 26 19:21:26 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Wed, 26 Apr 2017 16:21:26 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJivH7K2FkBE0nQBxSaRt=sSx+vGrFDxyx0eT8J+f216+cw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAF6FJiuu9kTEgrCZrx7QC9jBPBpki+ChVfo3yoXb3-k9mkim9w@mail.gmail.com>
 <CAEQ_Tvd2HXU17Z+43=P+vgh0AiQvA9g5AXy15yS-PhnunTwEbA@mail.gmail.com>
 <CANm_+ZoTWVZpc4SRG5NrVRXbusRBaWyme+d7RGxwqxjry5cEvQ@mail.gmail.com>
 <CAF6FJivH7K2FkBE0nQBxSaRt=sSx+vGrFDxyx0eT8J+f216+cw@mail.gmail.com>
Message-ID: <CALGmxEKPnQpFds63XN3sECNxpCnvP4W6iGknxe=7p4dXSWrCJA@mail.gmail.com>

On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern <robert.kern at gmail.com> wrote:

> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases.
>

isn't UTF-32 pretty compressible also? lots of zeros in there....

here's an example with pure ascii  Lorem Ipsum text:

In [17]: len(text)
Out[17]: 446


In [18]: len(utf8)
Out[18]: 446

# the same -- it's pure ascii

In [20]: len(utf32)
Out[20]: 1788

# four times a big -- of course.

In [22]: len(bz2.compress(utf8))
Out[22]: 302

# so from 446 to 302, not that great -- probably it would be better for
longer text
# -- but are compressing whole arrays or individual strings?

In [23]: len(bz2.compress(utf32))
Out[23]: 319

# almost as good as the compressed utf-8

And I'm guessing it would be even closer with more non-ascii charactors.

OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii
charactors:

In [29]: len(text)
Out[29]: 672

In [30]: utf8 = text.encode("utf-8")

In [31]: len(utf8)
Out[31]: 1180

# not bad, really -- still smaller than utf-16 :-)

In [33]: len(bz2.compress(utf8))
Out[33]: 495

# pretty good then -- better than 50%

In [34]: utf32 = text.encode("utf-32")
In [35]: len(utf32)

Out[35]: 2692


In [36]: len(bz2.compress(utf32))
Out[36]: 515

# still not quite as good as utf-8, but close.

So: utf-8 compresses better than utf-32, but only by a little bit -- at
least with bz2.

But it is a lot smaller uncompressed.

>>> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
> >>
> >> It's not just HDF5. Counting bytes is the Right Way to measure the size
> of UTF-8 encoded text:
> >> http://utf8everywhere.org/#myths
>

It's really the only way with utf-8 -- which is why it is an impedance
mismatch with python strings.


>> I also firmly believe (though clearly this is not universally agreed
> upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications.
>

fortunately, we don't need to agree to that to agree that:


> So if we're adding any new string encodings, it needs to be one of them.
>

Yup -- the most important one to add -- I don't think it is "The Right Way"
for all applications -- but it "The Right Way" for text interchange.

And regardless of what any of us think -- it is widely used.

> (1) object arrays of strings. (We have these already; whether a
> strings-only specialization would permit useful things like string-oriented
> ufuncs is a question for someone who's willing to implement one.)
>

This is the right way to get variable length strings -- but I'm concerned
that it doesn't mesh well with numpy uses like npz files, raw dumping of
array data, etc. It should not be the only way to get proper Unicode
support, nor the default when you do:

array(["this", "that"])


> > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
> All python encodings should be permitted. An additional function to
> truncate encoded data without mangling the encoding would be handy.
>

I think necessary -- at least when you pass in a python string...


> I think it makes more sense for this to be NULL-padded than
> NULL-terminated but it may be necessary to support both; note that
> NULL-termination is complicated for encodings like UCS4.
>

is it if you know it's UCS4? or even know the size of the code-unit (I
think that's the term)


> This also includes the legacy UCS4 strings as a special case.
>

what's special about them? I think the only thing shold be that they are
the default.
>

> > (3) a dtype for fixed-length byte strings. This doesn't look very
> different from an array of dtype u8, but given we have the bytes type,
> accessing the data this way makes sense.
>
> The void dtype is already there for this general purpose and mostly works,
> with a few niggles.
>

I'd never noticed that! And if I had I never would have guessed I could use
it that way.


> If it worked more transparently and perhaps rigorously with `bytes`, then
> it would be quite suitable.
>

Then we should fix a bit of those things -- and call it soemthig like
"bytes", please.

-CHB

>
> --

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/de33ba76/attachment-0001.html>

From shoyer at gmail.com  Wed Apr 26 19:30:04 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 26 Apr 2017 16:30:04 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
 <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>
Message-ID: <CAEQ_Tve=-XkxuQhUPq4w4r24D0+=HzMPE1-x=Av9nxQZDm3miQ@mail.gmail.com>

On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> When a numpy user wants to put a string into a numpy array, they should
> know how long a string they can fit -- with "length" defined how python
> strings define it.
>

Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
myself have already given), but we seem to be talking past each other here.

I am still -1 on any new string encoding support unless that includes at
least UTF-8, with length indicated by the number of bytes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/9f4f7905/attachment.html>

From njs at pobox.com  Wed Apr 26 19:49:29 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Wed, 26 Apr 2017 16:49:29 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
Message-ID: <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>

On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern at gmail.com> wrote:

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
[...]
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:

  https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_
level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".


Since concrete examples are often helpful in focusing discussions, here's
some code for reading a lab-internal EEG file format:

https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py

See in particular _header_dtype with its embedded string fields, and the
code in _channel_names_from_header -- both of these really benefit from
having a quick and easy way to talk about fixed width strings of single
byte characters. (The history here of course is that the original tools for
reading/writing this format are written in C, and they just read in
sizeof(struct header) and cast to the header.)

_get_full_string in that file is also interesting: it's a nasty hack I
implemented because in some cases I actually needed *fixed width* strings,
not NUL padded ones, and didn't know a better way to do it. (Yes, there's
void, but I have no idea how those work. They're somehow related to buffer
objects, whatever those are?) In other cases though that file really does
want NUL padding.

Of course that file is python 2 and blissfully ignorant of unicode.
Thinking about what we'd want if porting to py3:

For the "pull out this fixed width chunk of the file" problem (what
_get_full_string does) then I definitely don't care about unicode; this
isn't text. np.void or an array of np.uint8 aren't actually too terrible I
suspect, but it'd be nice if there were a fixed-width dtype where indexing
gave back a native bytes or bytearray object, or something similar like
np.bytes_.

For the arrays of single-byte-encoded-NUL-padded text, then the fundamental
problem is just to convert between a chunk of bytes in that format and
something that numpy can handle. One way to do that would be with an dtype
that represented ascii-encoded-fixed-width-NUL-padded text, or any
ascii-compatible encoding. But honestly I'd be just as happy with
np.encode/np.decode ufuncs that converted between the existing S dtype and
any kind of text array; the existing U dtype would be fine given that.

The other thing that might be annoying in practice is that when writing
py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on
py3", but there's no dtype with similar behavior. Maybe there's no good
solution and this just needs a few version-dependent convenience functions
stuck in a private utility library, dunno.


> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.


I would really like to hear more from the authors of these libraries about
what exactly it is they feel they're missing. Is it that they want numpy to
enforce the length limit early, to catch errors when the array is modified
instead of when they go to write it to the file? Is it that they really
want an O(1) way to look at a array and know the maximum number of bytes
needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
really annoying and files that need it are rare so they haven't had the
motivation to implement it? My impression is similar to Julian's: you
*could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
dozen lines of code, which is nothing compared to all the other hoops these
libraries are already jumping through, so if this is really the roadblock
then I must be missing something.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/11f5d062/attachment.html>

From chris.barker at noaa.gov  Wed Apr 26 20:02:12 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Wed, 26 Apr 2017 17:02:12 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_Tve=-XkxuQhUPq4w4r24D0+=HzMPE1-x=Av9nxQZDm3miQ@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
 <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>
 <CAEQ_Tve=-XkxuQhUPq4w4r24D0+=HzMPE1-x=Av9nxQZDm3miQ@mail.gmail.com>
Message-ID: <CALGmxELgyLg+kS2MODgk0gPrb_96LjeXhvJKRCFPyF0jG7ctXw@mail.gmail.com>

On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

>
> Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
> myself have already given), but we seem to be talking past each other here.
>

yeah -- I think it's not clear what the use cases we are talking about are.


> I am still -1 on any new string encoding support unless that includes at
> least UTF-8, with length indicated by the number of bytes.
>

I've said multiple times that utf-8 support is key to any "exchange binary
data" use case (memory mapping?) -- so yes, absolutely.

I _think_ this may be some of the source for the confusion:

The name of this thread is: "proposal: smaller representation of string
arrays".

And I got the impression, maybe mistaken, that folks were suggesting that
internally encoding strings in numpy as "UTF-8, with length indicated by
the number of bytes." was THE solution to the

" the 'U' dtype takes up way too much memory, particularly  for
mostly-ascii data" problem.

I do not think it is a good solution to that problem.

I think a good solution to that problem is latin-1 encoding. (bear with me
here...)

But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:

* Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.

For THAT -- utf-8 is critical.

But if I understand Julian's proposal -- he wants to create a parameterized
text dtype that you can set the encoding on, and then numpy will use the
encoding (and python's machinery) to encode / decode when passing to/from
python strings.

It seems this would support all our desires:

I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.

Thomas would get latin-1 for binary interchange with mostly-ascii data

The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)

Even folks that had weird JAVA or Windows-generated UTF-16 data files could
do the binary interchange thing....

I'm now lost as to what the hang-up is.

-CHB

PS: null padding is a pain, python strings seem to preserve the zeros, whic
is odd -- is thre a unicode code-point at x00?

But you can use it to strip properly with the unicode sandwich:

In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00'

In [64]: ut16.decode('utf-16')
Out[64]: 'some text\x00\x00\x00'

In [65]: ut16.decode('utf-16').strip('\x00')
Out[65]: 'some text'

In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16')
Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00'

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/f882bf97/attachment-0001.html>

From robert.kern at gmail.com  Wed Apr 26 20:08:30 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 17:08:30 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
Message-ID: <CAF6FJisSDmU2NiEa8sFdqT8CMHQXhc774VpeAML+zOL1n5i0KQ@mail.gmail.com>

On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern at gmail.com> wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support
Unicode arrays, despite user requests for years. The sticking point seems
to be the difference between HDF5's view of a Unicode string array (defined
in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
string array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.
>
> I would really like to hear more from the authors of these libraries
about what exactly it is they feel they're missing. Is it that they want
numpy to enforce the length limit early, to catch errors when the array is
modified instead of when they go to write it to the file? Is it that they
really want an O(1) way to look at a array and know the maximum number of
bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
is really annoying and files that need it are rare so they haven't had the
motivation to implement it?

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/379

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/704eb753/attachment.html>

From robert.kern at gmail.com  Wed Apr 26 20:17:29 2017
From: robert.kern at gmail.com (Robert Kern)
Date: Wed, 26 Apr 2017 17:17:29 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxELgyLg+kS2MODgk0gPrb_96LjeXhvJKRCFPyF0jG7ctXw@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
 <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>
 <CAEQ_Tve=-XkxuQhUPq4w4r24D0+=HzMPE1-x=Av9nxQZDm3miQ@mail.gmail.com>
 <CALGmxELgyLg+kS2MODgk0gPrb_96LjeXhvJKRCFPyF0jG7ctXw@mail.gmail.com>
Message-ID: <CAF6FJiuaTAGTRapj2+T2T_wDg6fHHsuOFo_7PUKpPEAwA7k4ng@mail.gmail.com>

On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:
>
> * Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.
>
> For THAT -- utf-8 is critical.
>
> But if I understand Julian's proposal -- he wants to create a
parameterized text dtype that you can set the encoding on, and then numpy
will use the encoding (and python's machinery) to encode / decode when
passing to/from python strings.
>
> It seems this would support all our desires:
>
> I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.
>
> Thomas would get latin-1 for binary interchange with mostly-ascii data
>
> The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
>
> Even folks that had weird JAVA or Windows-generated UTF-16 data files
could do the binary interchange thing....
>
> I'm now lost as to what the hang-up is.

The proposal is for only latin-1 and UTF-32 to be supported at first, and
the eventual support of UTF-8 will be constrained by specification of the
width in terms of characters rather than bytes, which conflicts with the
use cases of UTF-8 that have been brought forth.

  https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/bb1993ec/attachment.html>

From chris.barker at noaa.gov  Wed Apr 26 20:50:22 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Wed, 26 Apr 2017 17:50:22 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAF6FJiuaTAGTRapj2+T2T_wDg6fHHsuOFo_7PUKpPEAwA7k4ng@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CAF6FJiuK50LAF2RZoE5bP066NmCLDzf5i1z4SNSgv2x9D39m0Q@mail.gmail.com>
 <CANm_+ZqEHrES+wwb1539GbwbpzW8wdsL3bkMxQfUtWY0Qs=hAA@mail.gmail.com>
 <a2c587b9-60de-60e2-5ff4-4fac9fcbae60@stsci.edu>
 <CAF6FJisHtC8gxtcY-SpuZe7TW2zF7pjL4MPnULFbrZtXrr6Gig@mail.gmail.com>
 <8741041756854148453@unknownmsgid>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <-5378706506035339722@unknownmsgid>
 <CAPJVwBnpjT4-FjnRCxFirT3th1UUCURnPm6CMBrv5XpHdgMtxw@mail.gmail.com>
 <CALGmxEL9gKcVzRHU+C5zTHwCYQP=ocVaTSS_igMgOFjdy4WAXg@mail.gmail.com>
 <CAEQ_Tve=-XkxuQhUPq4w4r24D0+=HzMPE1-x=Av9nxQZDm3miQ@mail.gmail.com>
 <CALGmxELgyLg+kS2MODgk0gPrb_96LjeXhvJKRCFPyF0jG7ctXw@mail.gmail.com>
 <CAF6FJiuaTAGTRapj2+T2T_wDg6fHHsuOFo_7PUKpPEAwA7k4ng@mail.gmail.com>
Message-ID: <CALGmxEJ7BCnyPpbSG9A6A-Qofa96J-bARJiVMQevO=e8d2AHNw@mail.gmail.com>

On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern <robert.kern at gmail.com> wrote:

> The proposal is for only latin-1 and UTF-32 to be supported at first, and
> the eventual support of UTF-8 will be constrained by specification of the
> width in terms of characters rather than bytes, which conflicts with the
> use cases of UTF-8 that have been brought forth.
>
>   https://mail.python.org/pipermail/numpy-discussion/
> 2017-April/076668.html
>

thanks -- I had forgotten (clearly) it was that limited.

But my question now is -- if there is a encoding-parameterized string
dtype, then is it much more effort to have it support all the encodings in
the stdlib?

It seems that would solve everyone's issue.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/b9c9c14c/attachment.html>

From shoyer at gmail.com  Wed Apr 26 21:34:41 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 26 Apr 2017 18:34:41 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
Message-ID: <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>

On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:

> It's worthwhile enough that both major HDF5 bindings don't support Unicode
> arrays, despite user requests for years. The sticking point seems to be the
> difference between HDF5's view of a Unicode string array (defined in size
> by the bytes of UTF-8 data) and numpy's current view of a Unicode string
> array (because of UCS-4, defined by the number of
> characters/codepoints/whatever). So there are HDF5 files out there that
> none of our HDF5 bindings can read, and it is impossible to write certain
> data efficiently.
>
>
> I would really like to hear more from the authors of these libraries about
> what exactly it is they feel they're missing. Is it that they want numpy to
> enforce the length limit early, to catch errors when the array is modified
> instead of when they go to write it to the file? Is it that they really
> want an O(1) way to look at a array and know the maximum number of bytes
> needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
> really annoying and files that need it are rare so they haven't had the
> motivation to implement it? My impression is similar to Julian's: you
> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
> dozen lines of code, which is nothing compared to all the other hoops these
> libraries are already jumping through, so if this is really the roadblock
> then I must be missing something.
>

I actually agree with you. I think it's mostly a matter of convenience that
h5py matched up HDF5 dtypes with numpy dtypes:
fixed width ASCII -> np.string_/bytes
variable length ASCII -> object arrays of np.string_/bytes
variable length UTF-8 -> object arrays of unicode

This was tenable in a Python 2 world, but on Python 3 it's broken and
there's not an easy fix.

We absolutely could fix h5py by mapping everything to object arrays of
Python unicode strings, as has been discussed (
https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
be a fine but non-ideal solution, since there is currently no fixed width
UTF-8 support.

For fixed width ASCII arrays, this would mean increased convenience for
Python 3 users, at the price of decreased convenience for Python 2 users
(arrays now contain boxed Python objects), unless we made the h5py behavior
dependent on the version of Python. Hence, we're back here, waiting for
better dtypes for encoded strings.

So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
handling ASCII arrays as strings) and UTF-8 with length equal to the number
of bytes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/11e10ec9/attachment-0001.html>

From faltet at gmail.com  Thu Apr 27 07:10:42 2017
From: faltet at gmail.com (Francesc Alted)
Date: Thu, 27 Apr 2017 13:10:42 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
 <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
Message-ID: <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>

2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:

> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> It's worthwhile enough that both major HDF5 bindings don't support
>> Unicode arrays, despite user requests for years. The sticking point seems
>> to be the difference between HDF5's view of a Unicode string array (defined
>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>> string array (because of UCS-4, defined by the number of
>> characters/codepoints/whatever). So there are HDF5 files out there that
>> none of our HDF5 bindings can read, and it is impossible to write certain
>> data efficiently.
>>
>>
>> I would really like to hear more from the authors of these libraries
>> about what exactly it is they feel they're missing. Is it that they want
>> numpy to enforce the length limit early, to catch errors when the array is
>> modified instead of when they go to write it to the file? Is it that they
>> really want an O(1) way to look at a array and know the maximum number of
>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>> is really annoying and files that need it are rare so they haven't had the
>> motivation to implement it? My impression is similar to Julian's: you
>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>> dozen lines of code, which is nothing compared to all the other hoops these
>> libraries are already jumping through, so if this is really the roadblock
>> then I must be missing something.
>>
>
> I actually agree with you. I think it's mostly a matter of convenience
> that h5py matched up HDF5 dtypes with numpy dtypes:
> fixed width ASCII -> np.string_/bytes
> variable length ASCII -> object arrays of np.string_/bytes
> variable length UTF-8 -> object arrays of unicode
>
> This was tenable in a Python 2 world, but on Python 3 it's broken and
> there's not an easy fix.
>
> We absolutely could fix h5py by mapping everything to object arrays of
> Python unicode strings, as has been discussed (
> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
> be a fine but non-ideal solution, since there is currently no fixed width
> UTF-8 support.
>
> For fixed width ASCII arrays, this would mean increased convenience for
> Python 3 users, at the price of decreased convenience for Python 2 users
> (arrays now contain boxed Python objects), unless we made the h5py behavior
> dependent on the version of Python. Hence, we're back here, waiting for
> better dtypes for encoded strings.
>
> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
> handling ASCII arrays as strings) and UTF-8 with length equal to the number
> of bytes.
>

Well, I'll say upfront that I have not read this discussion in the fully,
but apparently some opinions from developers of HDF5 Python packages would
be welcome here, so here I go :) ?

As a long-time developer of one of the Python HDF5 packages (PyTables), I
have always been of the opinion that plain ASCII (for byte strings) and
UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing
large amounts of data, most specially for disk storage (but also using
compressed in-memory containers).  My rational is that, although UCS-4 may
require way too much space, compression would reduce that to basically the
space that is required by compressed UTF-8 (I won't go into detail, but
basically this is possible by using the shuffle filter).

I remember advocating for UCS-4 adoption in the HDF5 library many years ago
(2007?), but I had no success and UTF-8 was decided to be the best
candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
don't think there is a go back (not even adding UCS-4 support on it,
although I continue to think it would be a good idea).  So, I suppose that
if HDF5 is found to be an important format for NumPy users (and I think
this is the case), a solution for representing Unicode characters by using
UTF-8 in NumPy would be desirable (at the risk of making the implementation
more complex).

?Francesc
?

>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/cc8aa05a/attachment.html>

From ndbecker2 at gmail.com  Thu Apr 27 07:27:31 2017
From: ndbecker2 at gmail.com (Neal Becker)
Date: Thu, 27 Apr 2017 11:27:31 +0000
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
 <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
 <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
Message-ID: <CAG3t+pGhgrXJFTtn3JiBwTJuAkFHUpzEbPLdETA=PeHTd4C7Ow@mail.gmail.com>

So while compression+ucs-4 might be OK for out-of-core representation, what
about in-core?  blosc+ucs-4?  I don't think that works for mmap, does it?

On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet at gmail.com> wrote:

> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:
>
>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> It's worthwhile enough that both major HDF5 bindings don't support
>>> Unicode arrays, despite user requests for years. The sticking point seems
>>> to be the difference between HDF5's view of a Unicode string array (defined
>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>> string array (because of UCS-4, defined by the number of
>>> characters/codepoints/whatever). So there are HDF5 files out there that
>>> none of our HDF5 bindings can read, and it is impossible to write certain
>>> data efficiently.
>>>
>>>
>>> I would really like to hear more from the authors of these libraries
>>> about what exactly it is they feel they're missing. Is it that they want
>>> numpy to enforce the length limit early, to catch errors when the array is
>>> modified instead of when they go to write it to the file? Is it that they
>>> really want an O(1) way to look at a array and know the maximum number of
>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>> is really annoying and files that need it are rare so they haven't had the
>>> motivation to implement it? My impression is similar to Julian's: you
>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>> dozen lines of code, which is nothing compared to all the other hoops these
>>> libraries are already jumping through, so if this is really the roadblock
>>> then I must be missing something.
>>>
>>
>> I actually agree with you. I think it's mostly a matter of convenience
>> that h5py matched up HDF5 dtypes with numpy dtypes:
>> fixed width ASCII -> np.string_/bytes
>> variable length ASCII -> object arrays of np.string_/bytes
>> variable length UTF-8 -> object arrays of unicode
>>
>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>> there's not an easy fix.
>>
>> We absolutely could fix h5py by mapping everything to object arrays of
>> Python unicode strings, as has been discussed (
>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>> would be a fine but non-ideal solution, since there is currently no fixed
>> width UTF-8 support.
>>
>> For fixed width ASCII arrays, this would mean increased convenience for
>> Python 3 users, at the price of decreased convenience for Python 2 users
>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>> dependent on the version of Python. Hence, we're back here, waiting for
>> better dtypes for encoded strings.
>>
>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>> of bytes.
>>
>
> Well, I'll say upfront that I have not read this discussion in the fully,
> but apparently some opinions from developers of HDF5 Python packages would
> be welcome here, so here I go :) ?
>
> As a long-time developer of one of the Python HDF5 packages (PyTables), I
> have always been of the opinion that plain ASCII (for byte strings) and
> UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing
> large amounts of data, most specially for disk storage (but also using
> compressed in-memory containers).  My rational is that, although UCS-4 may
> require way too much space, compression would reduce that to basically the
> space that is required by compressed UTF-8 (I won't go into detail, but
> basically this is possible by using the shuffle filter).
>
> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back (not even adding UCS-4 support on it,
> although I continue to think it would be a good idea).  So, I suppose that
> if HDF5 is found to be an important format for NumPy users (and I think
> this is the case), a solution for representing Unicode characters by using
> UTF-8 in NumPy would be desirable (at the risk of making the implementation
> more complex).
>
> ?Francesc
> ?
>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> Francesc Alted
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/95ea4dfe/attachment-0001.html>

From faltet at gmail.com  Thu Apr 27 07:38:06 2017
From: faltet at gmail.com (Francesc Alted)
Date: Thu, 27 Apr 2017 13:38:06 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAG3t+pGhgrXJFTtn3JiBwTJuAkFHUpzEbPLdETA=PeHTd4C7Ow@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
 <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
 <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
 <CAG3t+pGhgrXJFTtn3JiBwTJuAkFHUpzEbPLdETA=PeHTd4C7Ow@mail.gmail.com>
Message-ID: <CAFrp1vpcSgtrWps51hGBdO+DEHfVqzdestT=6HH2aQsJdPMa5g@mail.gmail.com>

2017-04-27 13:27 GMT+02:00 Neal Becker <ndbecker2 at gmail.com>:

> So while compression+ucs-4 might be OK for out-of-core representation,
> what about in-core?  blosc+ucs-4?  I don't think that works for mmap, does
> it?
>

?Correct, the real problem is mmap for an out-of-core, HDF5 representation,
I presume.

For in-memory, there are several compressed data containers, like:

https://github.com/alimanfoo/zarr (meant mainly for multidimensional data
containers)
https://github.com/Blosc/bcolz? (meant mainly for tabular data containers)

?(there might be others).?


>
> On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet at gmail.com> wrote:
>
>> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:
>>
>>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>>> It's worthwhile enough that both major HDF5 bindings don't support
>>>> Unicode arrays, despite user requests for years. The sticking point seems
>>>> to be the difference between HDF5's view of a Unicode string array (defined
>>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>>> string array (because of UCS-4, defined by the number of
>>>> characters/codepoints/whatever). So there are HDF5 files out there
>>>> that none of our HDF5 bindings can read, and it is impossible to write
>>>> certain data efficiently.
>>>>
>>>>
>>>> I would really like to hear more from the authors of these libraries
>>>> about what exactly it is they feel they're missing. Is it that they want
>>>> numpy to enforce the length limit early, to catch errors when the array is
>>>> modified instead of when they go to write it to the file? Is it that they
>>>> really want an O(1) way to look at a array and know the maximum number of
>>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>>> is really annoying and files that need it are rare so they haven't had the
>>>> motivation to implement it? My impression is similar to Julian's: you
>>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>>> dozen lines of code, which is nothing compared to all the other hoops these
>>>> libraries are already jumping through, so if this is really the roadblock
>>>> then I must be missing something.
>>>>
>>>
>>> I actually agree with you. I think it's mostly a matter of convenience
>>> that h5py matched up HDF5 dtypes with numpy dtypes:
>>> fixed width ASCII -> np.string_/bytes
>>> variable length ASCII -> object arrays of np.string_/bytes
>>> variable length UTF-8 -> object arrays of unicode
>>>
>>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>>> there's not an easy fix.
>>>
>>> We absolutely could fix h5py by mapping everything to object arrays of
>>> Python unicode strings, as has been discussed (
>>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>>> would be a fine but non-ideal solution, since there is currently no fixed
>>> width UTF-8 support.
>>>
>>> For fixed width ASCII arrays, this would mean increased convenience for
>>> Python 3 users, at the price of decreased convenience for Python 2 users
>>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>>> dependent on the version of Python. Hence, we're back here, waiting for
>>> better dtypes for encoded strings.
>>>
>>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>>> of bytes.
>>>
>>
>> Well, I'll say upfront that I have not read this discussion in the fully,
>> but apparently some opinions from developers of HDF5 Python packages would
>> be welcome here, so here I go :) ?
>>
>> As a long-time developer of one of the Python HDF5 packages (PyTables), I
>> have always been of the opinion that plain ASCII (for byte strings) and
>> UCS-4 (for Unicode) encoding would be the appropriate dtypes? for storing
>> large amounts of data, most specially for disk storage (but also using
>> compressed in-memory containers).  My rational is that, although UCS-4 may
>> require way too much space, compression would reduce that to basically the
>> space that is required by compressed UTF-8 (I won't go into detail, but
>> basically this is possible by using the shuffle filter).
>>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back (not even adding UCS-4 support on it,
>> although I continue to think it would be a good idea).  So, I suppose that
>> if HDF5 is found to be an important format for NumPy users (and I think
>> this is the case), a solution for representing Unicode characters by using
>> UTF-8 in NumPy would be desirable (at the risk of making the implementation
>> more complex).
>>
>> ?Francesc
>> ?
>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>>
>> --
>> Francesc Alted
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/cc8445cf/attachment.html>

From chris.barker at noaa.gov  Thu Apr 27 12:18:47 2017
From: chris.barker at noaa.gov (Chris Barker)
Date: Thu, 27 Apr 2017 09:18:47 -0700
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
 <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
 <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
Message-ID: <CALGmxEK2ySqiG6bqjqKU8Km+3YUFo__BrM_JhN-WmJkTj7fS8g@mail.gmail.com>

On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet at gmail.com> wrote:

> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>

This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and ....

So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?

(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.

The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/faafcfdf/attachment.html>

From faltet at gmail.com  Thu Apr 27 12:57:03 2017
From: faltet at gmail.com (Francesc Alted)
Date: Thu, 27 Apr 2017 18:57:03 +0200
Subject: [Numpy-discussion] proposal: smaller representation of string
 arrays
In-Reply-To: <CALGmxEK2ySqiG6bqjqKU8Km+3YUFo__BrM_JhN-WmJkTj7fS8g@mail.gmail.com>
References: <e76dbb12-4573-156e-d41d-1bd7ce618e57@googlemail.com>
 <CALGmxE+JxKA30vBhV37KrM6t6VNkhj-s6xMxXyBywFWXrLCDbg@mail.gmail.com>
 <CAEQ_TvcLFdFh_8E751Zorqt3iGEjD5Kc0+KxsYkfWAOJqFcAyA@mail.gmail.com>
 <CALGmxEJYEcXcAVDhyYOaecMNfwbV0N-qKDuM+MUQ7ScfURVSaQ@mail.gmail.com>
 <CAEQ_TvdjVnNg+uMs7feaaevMKJSuG+2EQq4YCK59Eu16B=F_GQ@mail.gmail.com>
 <CALGmxEK=ssr0tQmKr7S985OL1dd1uH9NabmS4SYfdZAo4gdBJA@mail.gmail.com>
 <CAEQ_TvcsHCs-_VeyPgAQhzRmh0755L9bBTcmvmrRN6M6HgE65g@mail.gmail.com>
 <CALGmxEKg5s6K8qXpKVq6BS95O83JSgQUE=OKRFLcUEhnE7BNjQ@mail.gmail.com>
 <CANm_+Zp1BaWvRUUMn=tfzy_kKPXks67OWKD8u04VxKJ2Z+=44A@mail.gmail.com>
 <1229716955908306730@unknownmsgid>
 <CAF6FJiuO+=rpM6B0KKcjPmOoBqSUtWNXRu0x+17M1XXDD4T+3A@mail.gmail.com>
 <CAB6mnxLHOFKGYk4Mt=BiR2e88nqvn2L79CZv4c98yz1wQvoXqQ@mail.gmail.com>
 <CAMMTP+BfJ7UvdHsVeDt1EBY9k2eamA5PGCDAo8fKgM0jQ_GbvA@mail.gmail.com>
 <53eadf43-f79c-3960-4c6a-f9a1ddd21854@googlemail.com>
 <CAF6FJis9GAzNuEEFwKwckC0u3yGD=pz=Gf_a2ue3eT9NO9b_UQ@mail.gmail.com>
 <c39015cc-05c5-a472-5f95-b2390be9e7c7@googlemail.com>
 <1493231889.17161.3.camel@sipsolutions.net>
 <CAF6FJiuLRXszxENv7=J4PDuyD8QHfOfUhC+VXuNw+j_sqmfO5Q@mail.gmail.com>
 <CAPJVwBmULiFDrRoCSJC+vw6Y+K7zLxnu_jQSiKph4scpTA+VhA@mail.gmail.com>
 <CAEQ_TveMxghFuoTuSJCBMcsKR43RNQHngO8qXE7C2s0ump+Y8w@mail.gmail.com>
 <CAFrp1vpYAYFD45tsMRbOxvGO_TZ02J92mhhBMVfNJe8+BTD8Rg@mail.gmail.com>
 <CALGmxEK2ySqiG6bqjqKU8Km+3YUFo__BrM_JhN-WmJkTj7fS8g@mail.gmail.com>
Message-ID: <CAFrp1vo9iXFON51g+vcHHkRj-tShfTMO8nz0fcDZTb8465oA8Q@mail.gmail.com>

2017-04-27 18:18 GMT+02:00 Chris Barker <chris.barker at noaa.gov>:

> On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet at gmail.com> wrote:
>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back
>>
>
> This is the key point -- we can argue all we want about the best encoding
> for fixed-length unicode-supporting strings (I think numpy and HDF have
> very similar requirements), but that is not our decision to make -- many
> other systems have chosen utf-8, so it's a really good idea for numpy to be
> able to deal with that cleanly and easily and consistently.
>

?Agreed.  But it would also be a good idea to spread the word that simple
UCS4 encoding in combination with compression can be a perfectly good
system? for storing large amounts of unicode data too.


>
> I have made many anti utf-8 points in this thread because while we need to
> deal with utf-8 for interplay with other systems, I am very sure that it is
> not the best format for a default, naive-user-of-numpy unicode-supporting
> dtype. Nor is it the best encoding for a mostly-ascii compact in memory
> format.
>

?I resonate a lot with this feeling too :)?


>
> So I think numpy needs to support at least:
>
> utf-8
> latin-1
> UCS-4
>
> And it maybe should support one-byte encoding suitable for non-european
> languages, and maybe utf-16 for Java and Windows compatibility, and ....
>
> So that seems to point to "support as many encodings as possible" And
> python has the machinery to do so -- so why not?
>
> (I'm taking Julian's word for it that having a parameterized dtype would
> not have a major impact on current code)
>
> If we go with a parameterized by encoding string dtype, then we can pick
> sensible defaults, and let users use what they know best fits their
> use-cases.
>
> As for python2 -- it is on the way out, I think we should keep the 'U' and
> 'S' dtypes as they are for backward compatibility and move forward with the
> new one(s) in a way that is optimized for py3. And it would map to a py2
> Unicode type.
>
> The only catch I see in that is what to do with bytes -- we should have a
> numpy dtype that matches the bytes model -- fixed length bytes that map to
> python bytes objects. (this is almost what teh void type is yes?) but then
> under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
> bytes objects??
>
> @Francesc: -- one more question for you:
>
> How important is it for pytables to match the numpy storage to the hdf
> storage byte for byte? i.e. would it be a killer if encoding / decoding
> happened every time at the boundary? I'm guessing yes, as this would have
> been solved long ago if not.
>

?The PyTables team decided some time ago that it was a waste of time and
resources to maintain the internal HDF5 interface, and that it would be
better to switch to h5py for the low I/O communication with HDF5 (btw, we
just received? a small NumFOCUS grant for continue the ongoing work on
this; thanks guys!).  This means that PyTables will be basically agnostic
about this sort of encoding issues, and that the important package to have
in account for interfacing NumPy and HDF5 is just h5py.

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/52ef694c/attachment.html>

From opossumnano at gmail.com  Fri Apr 28 05:29:38 2017
From: opossumnano at gmail.com (Tiziano Zito)
Date: Fri, 28 Apr 2017 02:29:38 -0700 (PDT)
Subject: [Numpy-discussion] =?utf-8?b?W0FOTl0gMTDhtYDhtLQgQWR2YW5jZWQgU2Np?=
 =?utf-8?q?entific_Programming_in_Python_in_Nikiti=2C_Greece=2C_August_28?=
 =?utf-8?q?=E2=80=94September_2=2C_2017?=
Message-ID: <59030b82.ea85df0a.38a89.24cc@mx.google.com>

10?? Advanced Scientific Programming in Python
==============================================
a Summer School by the G-Node and the Municipality of Sithonia

Scientists spend more and more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices which are standard in the industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project ? an entertaining computer game.

We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to become a standard tool for the programming scientist.

This school is targeted at Master or PhD students and Post-docs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.

We are striving hard to get a pool of students which is international and gender-balanced.

You can apply online: https://python.g-node.org
Application deadline: 23:59 UTC, May 31, 2017. There will be no deadline extension, so be sure to apply on time ;-)
Be sure to read the FAQ before applying.

Participation is for free, i.e. no fee is charged! Participants however should take care of travel, living, and accommodation expenses by themselves.

Date & Location
===============
August 28?September 2, 2017. Nikiti, Sithonia, Halkidiki, Greece

Program
=======
? Best Programming Practices
  ? Best practices for scientific programming
  ? Version control with git and how to contribute to open source projects with GitHub
  ? Best practices in data visualization
? Software Carpentry
  ? Test-driven development
  ? Debugging with a debuggger
  ? Profiling code
? Scientific Tools for Python
  ? Advanced NumPy
? Advanced Python
  ? Decorators
  ? Context managers
  ? Generators
? The Quest for Speed
  ? Writing parallel applications
  ? Interfacing to C with Cython
  ? Memory-bound problems and memory profiling
  ? Data containers: storage and fast access to large data
? Practical Software Development
  ? Group project

Preliminary Faculty
===================
? Francesc Alted, freelance consultant, author of Blosc, Castell? de la Plana, Spain
? Pietro Berkes, NAGRA Kudelski, Lausanne, Switzerland
? Zbigniew J?drzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, VA USA
? Eilif Muller, Blue Brain Project, ?cole Polytechnique F?d?rale de Lausanne Switzerland
? Juan Nunez-Iglesias, Victorian Life Sciences Computation Initiative, University of Melbourne, Australia
? Rike-Benjamin Schuppner, Institute for Theoretical Biology, Humboldt-Universit?t zu Berlin, Germany
? Nicolas P. Rougier, Inria Bordeaux Sud-Ouest, Institute of Neurodegenerative Disease, University of Bordeaux, France
? Bartosz Tele?czuk, European Institute for Theoretical Neuroscience, CNRS, Paris, France
? St?fan van der Walt, Berkeley Institute for Data Science, UC Berkeley, CA USA
? Nelle Varoquaux, Berkeley Institute for Data Science, UC Berkeley, CA USA
? Tiziano Zito, freelance consultant, Berlin, Germany

Organizers
==========
For the German Neuroinformatics Node of the INCF (G-Node) Germany:
? Tiziano Zito, freelance consultant, Berlin, Germany
? Zbigniew J?drzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, USA
? Jakob Jordan, Institute of Neuroscience and Medicine (INM-6), Forschungszentrum J?lich GmbH, Germany
? Etienne Roesch, Centre for Integrative Neuroscience and Neurodynamics, University of Reading, UK

Website: https://python.g-node.org
Contact: python-info at g-node.org

From harrigan.matthew at gmail.com  Fri Apr 28 12:53:54 2017
From: harrigan.matthew at gmail.com (Matthew Harrigan)
Date: Fri, 28 Apr 2017 12:53:54 -0400
Subject: [Numpy-discussion] [NumPy-discussion] Wish List of Possible ufunc
 Enhancements
Message-ID: <CAOfRF=gGR9Okfgf+=ZbubnLOTBNxkgEzubFBbrzrwtTN241=Uw@mail.gmail.com>

Here is a link
<https://gist.github.com/mattharrigan/b784b50b451a41f327b3a990ec043557> to
a wish list of possible ufunc enhancements.  I would like to know what the
community thinks.

Thank you,
Matt Harrigan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170428/facb49dc/attachment.html>

From njs at pobox.com  Sat Apr 29 01:29:45 2017
From: njs at pobox.com (Nathaniel Smith)
Date: Fri, 28 Apr 2017 22:29:45 -0700
Subject: [Numpy-discussion] [NumPy-discussion] Wish List of Possible
 ufunc Enhancements
In-Reply-To: <CAOfRF=gGR9Okfgf+=ZbubnLOTBNxkgEzubFBbrzrwtTN241=Uw@mail.gmail.com>
References: <CAOfRF=gGR9Okfgf+=ZbubnLOTBNxkgEzubFBbrzrwtTN241=Uw@mail.gmail.com>
Message-ID: <CAPJVwBkWPnrawo7ENd+LTcnYcMhW0y_LruDkHKThpj48+KrbUA@mail.gmail.com>

On Fri, Apr 28, 2017 at 9:53 AM, Matthew Harrigan
<harrigan.matthew at gmail.com> wrote:
> Here is a link to a wish list of possible ufunc enhancements.  I would like
> to know what the community thinks.

It looks like a pretty good list of ideas worth thinking about as and
when someone has time :-). I'm not sure what feedback you're looking
for beyond that? Do you have a purpose in mind for this list?

The main thing I'd add is: making it possible for ufunc core loops to
access the dtype object. This is the main blocker on a *lot* of
things, probably more so than anything else on that list, because it
would allow ufunc operations to be defined for parametrized dtypes
like the S and U dtypes, categorical data, etc.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org