[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 01:19:22 EDT 2017

On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases. That
> doesn't solve the in memory problem, but does have some advantages on disk
> as well as making for easy display. We could compress it ourselves after
> encoding by truncation.
>
> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
>

It's not just HDF5. Counting bytes is the Right Way to measure the size of
UTF-8 encoded text:
http://utf8everywhere.org/#myths

I also firmly believe (though clearly this is not universally agreed upon)
that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/5beae8ef/attachment-0001.html>