[Numpy-discussion] String type again.

Wed Jul 16 16:51:39 EDT 2014

> But HDF5
> additionally has a fixed-storage-width UTF8 type, so we could map to a
> NumPy fixed-storage-width type trivially.

Sure -- this is why *nix uses utf-8 for filenames -- it can just be a
char*. But that just punts the problem to client code.

I think a UTF-8 string type does not match the numpy model well, and I
don't think we should support it just because it would be easier for
the HDF 5 wrappers.

( to be fair, there are probably other similar systems numpy wants to
interface with that cod use this...)

It seems if you want a 1:1 binary mapping between HDF and numpy for
utf strings, then a bytes type in numpy makes more sense. Numpy
could/should have encode and decode methods for converting byte arrays
to/from Unicode arrays (does it already? ).

> "Custom" in this context means a user-created HDF5 data-conversion
> filter, which is necessary since all data conversion is handled inside
> the HDF5 library.

> As far as generic Unicode goes, we currently don't support the NumPy
> "U" dtype in h5py for similar reasons; there's no destination type in
> HDF5 which (1) would preserve the dtype for round-trip write/read
> operations and (2) doesn't risk truncation.

It sounds to like HDF5 simply doesn't support Unicode. Calling an
array of bytes utf-8 simple pushes the problem on to client libs. As
that's where the problem lies, then the PyHDF may be the place to
address it.

If we put utf-8 in numpy, we have the truncation problem there instead
-- which is exactly what I think we should avoid.

> A Latin-1 based 'a' type
> would have similar problems.

Maybe not -- latin1 is fixed width.

>> Does HDF enforce ascii-only? what does it do with the > 127 values?
>
> Unfortunately/fortunately the charset is not enforced for either ASCII

So you can dump Latin-1 into and out of the HDF 'ASCII' type -- it's
essentially the old char* / py2 string. An ugly situation, but why not
use it?

> or UTF-8,

So ASCII and utf-8 are really the same thing, with different meta-data...

> although the HDF Group has been thinking about it.

I wonder if they would consider going Latin-1 instead of ASCII --
similarly to utf-8 it's backward compatible with ASCII, but gives you
a little more.

I don't know that there is another 1byte encoding worth using -- it
maybe be my English bias, but it seems Latin-1 gives us ASCII+some
extra stuff handy for science ( I use the degree symbol a lot, for
instance) with nothing lost.

> Ideally, NumPy would support variable-length
> strings, in which case all these headaches would go away.

Would they? That would push the problem back to PyHDF -- which I'm
arguing is where it belongs, but I didn't think you were ;-)
>
> But I
> imagine that's also somewhat complicated. :)

That's a whole other kettle of fish, yes.

-Chris