[Numpy-discussion] String type again.

Wed Jul 16 10:01:45 EDT 2014

On Tue, Jul 15, 2014 at 11:15 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:

>
>
>
> On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <
> sebastian at sipsolutions.net> wrote:
>
>> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
>> > As previous posts have pointed out, Numpy's `S` type is currently
>> > treated as a byte string, which leads to more complicated code in
>> > python3. OTOH, the unicode type is stored as UCS4, which consumes a
>> > lot of space, especially for ascii strings. This note proposes to
>> > adapt the currently existing 'a' type letter, currently aliased to
>> > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
>> > internal representations for unicode strings, ascii and latin1. Ascii
>> > has the advantage that it is a subset of UTF-8, whereas latin1 has a
>> > few more symbols. Another possibility is to just make it an UTF-8
>> > encoding, but I think this would involve more overhead as Python would
>> > need to determine the maximum character size. These are just
>> > preliminary thoughts, comments are welcome.
>> >
>>
>> Just wondering, couldn't we have a type which actually has an
>> (arbitrary, python supported) encoding (and "bytes" might even just be a
>> special case of no encoding)? Basically storing bytes and on access do
>> element[i].decode(specified_encoding) and on storing element[i] =
>> value.encode(specified_encoding).
>>
>> There is always the never ending small issue of trailing null bytes. If
>> we want to be fully compatible, such a type would have to store the
>> string length explicitly to support trailing null bytes.
>>
>
> UTF-8 encoding works with null bytes. That is one of the reasons it is so
> popular.
>

>
> Thinking more about it, the easiest thing to do might be to make the S
> dtype a UTF-8 encoding. Most of the machinery to deal with that is already
> in place. That change might affect some users though, and we might need to
> do some work to make it backwards compatible with python 2.
>
> Chuck

Are you saying that numpy S dtypes would be exported to Py3 as str?  This
would work in my use case, though it seems it would break things for the
(few-ish) people using the numpy S type in Py3 since it would now look like
a Python str instead of bytes object.

One other thought is that one *might* finesse the fixed width vs. utf-8
variable length issue by using the exact same rules that currently apply to
strings in Py2:

- When setting an array from input like a list of strings (unicode in Py3),
make the array wide enough to handle the widest (in bytes) entry.
- When setting an element in an existing array, truncate any characters
that don't fit in the existing width.

In the second point note that the truncation would be full unicode
characters, not bytes.  This could be a point of confusion in some cases,
but it's simple to implement and formally consistent with current behavior.

- Tom

p.s. Strangely enough the mail I quoted from Chuck beginning with "Thinking
about it more .." never got to my email and I only happened to have seen it
in the archives.

> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140716/f3d54983/attachment.html>