[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 15:46:21 EDT 2017

On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald <peridot.faceted at gmail.com>
wrote:
>
> On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern at gmail.com> wrote:

>> For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.
>
> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

Never mind, then. :-)

>> If I had to jump ahead and propose new dtypes, I might suggest this:
>>
>> * For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.
>>
>> * Acknowledge the use cases of the current NULL-terminated np.string
dtype, but perhaps add a new canonical alias, document it as being for
those specific use cases, and deprecate/de-emphasize the current name.
>>
>> * Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.
>
> How would this differ from a numpy array of bytes with one more
dimension?

The scalar in the implementation being the scalar in the use case,
immutability of the scalar, directly working with b'' strings in and out
(and thus work with the Python codecs easily).

>> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).
>
> I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps? Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

HDF5 seems to support this, but only for ASCII and UTF8, not a large list
of encodings.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9ea8d3ea/attachment.html>