[Numpy-discussion] String type again.

Tue Jul 15 11:15:17 EDT 2014

On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <sebastian at sipsolutions.net>
wrote:

> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
> > As previous posts have pointed out, Numpy's `S` type is currently
> > treated as a byte string, which leads to more complicated code in
> > python3. OTOH, the unicode type is stored as UCS4, which consumes a
> > lot of space, especially for ascii strings. This note proposes to
> > adapt the currently existing 'a' type letter, currently aliased to
> > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
> > internal representations for unicode strings, ascii and latin1. Ascii
> > has the advantage that it is a subset of UTF-8, whereas latin1 has a
> > few more symbols. Another possibility is to just make it an UTF-8
> > encoding, but I think this would involve more overhead as Python would
> > need to determine the maximum character size. These are just
> > preliminary thoughts, comments are welcome.
> >
>
> Just wondering, couldn't we have a type which actually has an
> (arbitrary, python supported) encoding (and "bytes" might even just be a
> special case of no encoding)? Basically storing bytes and on access do
> element[i].decode(specified_encoding) and on storing element[i] =
> value.encode(specified_encoding).
>
> There is always the never ending small issue of trailing null bytes. If
> we want to be fully compatible, such a type would have to store the
> string length explicitly to support trailing null bytes.
>

UTF-8 encoding works with null bytes. That is one of the reasons it is so
popular.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140715/cdeb9b39/attachment.html>