[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 15:51:57 EDT 2017

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> >
> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
> >
> >
> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
> and variable length versions:
> > https://github.com/PyTables/PyTables/issues/499
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
> >
> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e58c5f00/attachment-0001.html>