[Numpy-discussion] A one-byte string dtype?

Tue Jan 21 09:48:11 EST 2014

On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

>
>
>
> On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>>
>> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
>> aldcroft at head.cfa.harvard.edu> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
>>> charlesr.harris at gmail.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
>>>> charlesr.harris at gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com>wrote:
>>>>>
>>>>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>>>>> <charlesr.harris at gmail.com> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>>>>>> oscar.j.benjamin at gmail.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >>
>>>>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
>>>>>> charlesr.harris at gmail.com>
>>>>>> >> wrote:
>>>>>> >> >
>>>>>> >> > I think we may want something like PEP 393. The S datatype may
>>>>>> be the
>>>>>> >> > wrong place to look, we might want a modification of U instead
>>>>>> so as to
>>>>>> >> > transparently get the benefit of python strings.
>>>>>> >>
>>>>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str
>>>>>> than it
>>>>>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>>>>>> >>
>>>>>> >> Since str is immutable the maximum code point in the string can be
>>>>>> >> determined once when the string is created before anything else
>>>>>> can get a
>>>>>> >> pointer to the string buffer.
>>>>>> >>
>>>>>> >> Since it is opaque no one can rightly expect it to expose a
>>>>>> particular
>>>>>> >> binary format so it is free to choose without compromising any
>>>>>> expected
>>>>>> >> semantics.
>>>>>> >>
>>>>>> >> If someone can call buffer on an array then the FSR is a semantic
>>>>>> change.
>>>>>> >>
>>>>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>>>>>> characters
>>>>>> >> then it would have a one byte per char buffer. What then happens
>>>>>> if you put
>>>>>> >> a higher code point in? The buffer needs to be resized and the
>>>>>> data copied
>>>>>> >> over. But then what happens to any buffer objects or array views?
>>>>>> They would
>>>>>> >> be pointing at the old buffer from before the resize. Subsequent
>>>>>> >> modifications to the resized array would not show up in other
>>>>>> views and vice
>>>>>> >> versa.
>>>>>> >>
>>>>>> >> I don't think that this can be done transparently since users of a
>>>>>> numpy
>>>>>> >> array need to know about the binary representation. That's why I
>>>>>> suggest a
>>>>>> >> dtype that has an encoding. Only in that way can it consistently
>>>>>> have both a
>>>>>> >> binary and a text interface.
>>>>>> >
>>>>>> >
>>>>>> > I didn't say we should change the S type, but that we should have
>>>>>> something,
>>>>>> > say 's', that appeared to python as a string. I think if we want
>>>>>> transparent
>>>>>> > string interoperability with python together with a compressed
>>>>>> > representation, and I think we need both, we are going to have to
>>>>>> deal with
>>>>>> > the difficulties of utf-8. That means raising errors if the string
>>>>>> doesn't
>>>>>> > fit in the allotted size, etc. Mind, this is a workaround for the
>>>>>> mass of
>>>>>> > ascii data that is already out there, not a substitute for 'U'.
>>>>>>
>>>>>> If we're going to be taking that much trouble, I'd suggest going ahead
>>>>>> and adding a variable-length string type (where the array itself
>>>>>> contains a pointer to a lookaside buffer, maybe with an optimization
>>>>>> for stashing short strings directly). The fixed-length requirement is
>>>>>> pretty onerous for lots of applications (e.g., pandas always uses
>>>>>> dtype="O" for strings -- and that might be a good workaround for some
>>>>>> people in this thread for now). The use of a lookaside buffer would
>>>>>> also make it practical to resize the buffer when the maximum code
>>>>>> point changed, for that matter...
>>>>>>
>>>>>
>>>> The more I think about it, the more I think we may need to do that.
>>>> Note that dynd has ragged arrays and I think they are implemented as
>>>> pointers to buffers. The easy way for us to do that would be a
>>>> specialization of object arrays to string types only as you suggest.
>>>>
>>>
>>> Is this approach intended to be in *addition to* the latin-1 "s" type
>>> originally proposed by Chris, or *instead of* that?
>>>
>>>
>> Well, that's open for discussion. The problem is to have something that
>> is both compact (latin-1) and interoperates transparently with python 3
>> strings (utf-8). A latin-1 type would be easier to implement and would
>> probably be a better choice for something available in both python 2 and
>> python 3, but unless the python 3 developers come up with something clever
>> I don't  see how to make it behave transparently as a string in python 3.
>> OTOH, it's not clear to me how to make utf-8 operate transparently with
>> python 2 strings, especially as the unicode representation choices in
>> python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
>> is unlikely to be backported. The problem may be unsolvable in a completely
>> satisfactory way.
>>
>
> Since it's open for discussion, I'll put in my vote for implementing the
> easier latin-1 version in the short term to facilitate Python 2 / 3
> interoperability.  This would solve my use-case (giga-rows of short fixed
> length strings), and presumably allow things like memory mapping of large
> data files (like for FITS files in astropy.io.fits).
>
> I don't have a clue how the current 'U' dtype works under the hood, but
> from my user perspective it seems to work just fine in terms of interacting
> with Python 3 strings.  Is there a technical problem with doing basically
> the same thing for an 's' dtype, but using latin-1 instead of UCS-4?
>

I think there is a technical problem. We may be able masquerade latin-1 as
utf-8  for some subset of characters or fool python 3 in some other way.
But in anycase, I think it needs some research to see what the
possibilities are.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/84a3572c/attachment.html>