[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 13:28:14 EDT 2017

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>
wrote:

> Is there any reason not to support all Unicode encodings that python does,
> with the same names and semantics? This would surely be the simplest to
> understand.
>

I think it should support all fixed-length encodings, but not the non-fixed
length ones -- they just don't fit well into the numpy data model.

> Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
> check with some non-Western users to make sure it's not going to wreck
> their lives? I'd have selected ASCII as an encoding to treat specially, if
> any, because Unicode already does that and the consequences are familiar.
> (I'm used to writing and reading French without accents because it's passed
> through ASCII, for example.)
>

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of
the accented characters for the European language and some symbols that are
nice to have (I use the degree symbol a lot...). And it is ASCII compatible
-- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the
community -- is there a substantial demand for a non-latin one-byte per
character encoding?

> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type.
>

we could do that, yes, but an improperly truncated "string" becomes invalid
-- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on
this!

Note: if you are storing a LOT of text (which I have no idea why you would
use numpy anyway), then the memory size might matter, but then
semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of
datasets, ids, etc, etc -- not massive amounts of text -- so storage space
really isn't critical. but having an id or something unexpectedly truncated
could be bad.

I think practical experience has shown us that people do not handle "mostly
fixed length but once in awhile not" text well -- see the nightmare of
UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors
are far more likely to be found in tests (why would you use utf-8 is all
your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for
interoperability with other systems -- but that makes errors more of an
issue -- if it doesn't pass through the numpy truncation machinery, invalid
data could easily get put in a numpy array.

-CHB

 it would allow UTF-8 to be used just the way it usually is - as an
> encoding that's almost 8-bit.
>

ouch! that perception is the route to way too many errors! it is by no
means almost 8-bit, unless your data are almost ascii -- in which case, use
latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it,
and only test it with mostly-ascii text, and not find the bugs that will
crop up later.

All this said, it seems to me that the important use cases for string
> arrays involve interaction with existing binary formats, so people who have
> to deal with such data should have the final say. (My own closest approach
> to this is the FITS format, which is restricted by the standard to ASCII.)
>

yup -- not sure we'll get much guidance here though -- netdf does not solve
this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file --
it's probably better to pull it out as bytes and pass it through the python
decoding/encoding machinery than pasting the bytes straight to a numpy
array and hope that the encoding and truncation are correct.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/0f87060b/attachment.html>