[Numpy-discussion] A one-byte string dtype?

Mon Jan 20 18:12:20 EST 2014

On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>> oscar.j.benjamin at gmail.com>
>> > wrote:
>> >>
>> >>
>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris at gmail.com
>> >
>> >> wrote:
>> >> >
>> >> > I think we may want something like PEP 393. The S datatype may be the
>> >> > wrong place to look, we might want a modification of U instead so as
>> to
>> >> > transparently get the benefit of python strings.
>> >>
>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than
>> it
>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>> >>
>> >> Since str is immutable the maximum code point in the string can be
>> >> determined once when the string is created before anything else can
>> get a
>> >> pointer to the string buffer.
>> >>
>> >> Since it is opaque no one can rightly expect it to expose a particular
>> >> binary format so it is free to choose without compromising any expected
>> >> semantics.
>> >>
>> >> If someone can call buffer on an array then the FSR is a semantic
>> change.
>> >>
>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>> characters
>> >> then it would have a one byte per char buffer. What then happens if
>> you put
>> >> a higher code point in? The buffer needs to be resized and the data
>> copied
>> >> over. But then what happens to any buffer objects or array views? They
>> would
>> >> be pointing at the old buffer from before the resize. Subsequent
>> >> modifications to the resized array would not show up in other views
>> and vice
>> >> versa.
>> >>
>> >> I don't think that this can be done transparently since users of a
>> numpy
>> >> array need to know about the binary representation. That's why I
>> suggest a
>> >> dtype that has an encoding. Only in that way can it consistently have
>> both a
>> >> binary and a text interface.
>> >
>> >
>> > I didn't say we should change the S type, but that we should have
>> something,
>> > say 's', that appeared to python as a string. I think if we want
>> transparent
>> > string interoperability with python together with a compressed
>> > representation, and I think we need both, we are going to have to deal
>> with
>> > the difficulties of utf-8. That means raising errors if the string
>> doesn't
>> > fit in the allotted size, etc. Mind, this is a workaround for the mass
>> of
>> > ascii data that is already out there, not a substitute for 'U'.
>>
>> If we're going to be taking that much trouble, I'd suggest going ahead
>> and adding a variable-length string type (where the array itself
>> contains a pointer to a lookaside buffer, maybe with an optimization
>> for stashing short strings directly). The fixed-length requirement is
>> pretty onerous for lots of applications (e.g., pandas always uses
>> dtype="O" for strings -- and that might be a good workaround for some
>> people in this thread for now). The use of a lookaside buffer would
>> also make it practical to resize the buffer when the maximum code
>> point changed, for that matter...
>>
>
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.

<snip>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/e54dc850/attachment.html>