[Numpy-discussion] A one-byte string dtype?

Mon Jan 20 17:28:09 EST 2014

On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin
<oscar.j.benjamin at gmail.com>wrote:

>
> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris at gmail.com>
> wrote:
> >
> > I think we may want something like PEP 393. The S datatype may be the
> wrong place to look, we might want a modification of U instead so as to
> transparently get the benefit of python strings.
>
> The approach taken in PEP 393 (the FSR) makes more sense for str than it
> does for numpy arrays for two reasons: str is immutable and opaque.
>
> Since str is immutable the maximum code point in the string can be
> determined once when the string is created before anything else can get a
> pointer to the string buffer.
>
> Since it is opaque no one can rightly expect it to expose a particular
> binary format so it is free to choose without compromising any expected
> semantics.
>
> If someone can call buffer on an array then the FSR is a semantic change.
>
> If a numpy 'U' array used the FSR and consisted only of ASCII characters
> then it would have a one byte per char buffer. What then happens if you put
> a higher code point in? The buffer needs to be resized and the data copied
> over. But then what happens to any buffer objects or array views? They
> would be pointing at the old buffer from before the resize. Subsequent
> modifications to the resized array would not show up in other views and
> vice versa.
>
> I don't think that this can be done transparently since users of a numpy
> array need to know about the binary representation. That's why I suggest a
> dtype that has an encoding. Only in that way can it consistently have both
> a binary and a text interface.
>

I didn't say we should change the S type, but that we should have
something, say 's', that appeared to python as a string. I think if we want
transparent string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/3c5a73ba/attachment.html>