[Numpy-discussion] String type again.

Fri Jul 18 12:54:03 EDT 2014

On Fri, Jul 18, 2014 at 9:32 AM, Andrew Collette <andrew.collette at gmail.com>
wrote:

> >> A Latin-1 based 'a' type
> >> would have similar problems.
> >
> > Maybe not -- latin1 is fixed width.
>
> Yes, Latin-1 is fixed width, but the issue is that when writing to a
> fixed-width UTF8 string in HDF5, it will expand, possibly losing data.
>

you shouldn't do that -- I was in no way suggesting that a latin-1 string
get pushed to a utf-8 array by default -- that would be a bad idea. utf-8
is a unicode encoding, it should be used for unicode.

As for truncation -- that's inherent in using a fixed-width array to store
a non-fixed width encoding.

What I would like to avoid is a situation where a user writes a
> 10-byte string from NumPy into a 10-byte space in an HDF5 dataset, and
> unexpectedly loses the last few characters because of the encoding
> mismatch.
>

Again, they shouldn't do that, they should be pushing a 10-character string
into something -- and utf-8 is going to (Possible) truncate that. That's
HDF/utf-8 limitation that people are going to have to deal with. I think
you're suggesting that numpy follow the HDF model, so that the numpy-HDF
transition can be clean and easy. However, I think that utf-8 is an
inappropriate model for numpy, and that the mess of bytes to utf-8 is
pyHDF's problem, not numpy's.

i.e your issue above -- should users put a 10 character string into a numpy
10 byte utf-8 type and see it truncated? That's what I want to avoid.

In any case, I certainly agree NumPy shouldn't be limited by the
> capabilities of HDF5.  There are other valuable use cases, including
> access to the high-bit characters Latin-1 provides.  But from a strict
> compatibility standpoint, ASCII would be beneficial.
>

This is where I wonder about HDF's "ascii" type -- is it really ascii? Or
is it that old standby
one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around
type? i.e the old char* ?

In which case, you can just push a latin-1 type into and out of your HDF
ascii arrays and everything will work just fine. Unless someone stores
something other than latin-1 or ascii in it -- but even then, the bytes
would still be preserved.

This is why I see no downside to latin-1 -- if you don't use the > 127 code
points, it's the same thing -- if you do, you get some extra handy
characters. The only difference is that a proper ascii type would not let
you store anything above 127 at all -- why restrict ourselves?

And if you want utf-8 in HDF, then use a unicode array knowing that some
truncation could occur, or use a byte array, and do the encoding yourself,
so the user knows exactly what they are doing.

[it would be nice if numpy had a pure numpy solution to encoding/decoding,
though maybe it wouldn't really be any faster than going through python
anyway...]

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140718/43139921/attachment.html>