[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Wed Apr 26 13:45:20 EDT 2017


On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:
>
> On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.kern at gmail.com>
wrote:
>>>
>>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>>>
>>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
that to size arrays by character length. The advantage over UTF-32 is that
it is easily compressible, probably by a factor of 4 in many cases. That
doesn't solve the in memory problem, but does have some advantages on disk
as well as making for easy display. We could compress it ourselves after
encoding by truncation.
>>>
>>> The major use case that we have for a UTF-8 array is HDF5, and it
specifies the width in bytes, not Unicode characters.
>>
>> It's not just HDF5. Counting bytes is the Right Way to measure the size
of UTF-8 encoded text:
>> http://utf8everywhere.org/#myths
>>
>> I also firmly believe (though clearly this is not universally agreed
upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
>
> It seems to me that most of the requirements people have expressed in
this thread would be satisfied by:
>
> (1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)
>
> (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
All python encodings should be permitted. An additional function to
truncate encoded data without mangling the encoding would be handy. I think
it makes more sense for this to be NULL-padded than NULL-terminated but it
may be necessary to support both; note that NULL-termination is complicated
for encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.
>
> (3) a dtype for fixed-length byte strings. This doesn't look very
different from an array of dtype u8, but given we have the bytes type,
accessing the data this way makes sense.

The void dtype is already there for this general purpose and mostly works,
with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the
scalars (fortunately, they do not appear to be mutable views). It also
accepts `bytes` strings that are too short (pads with NULs) and too long
(truncates). If it worked more transparently and perhaps rigorously with
`bytes`, then it would be quite suitable.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/fcd36b48/attachment.html>


More information about the NumPy-Discussion mailing list