[Numpy-discussion] proposal: smaller representation of string arrays

Fri Apr 21 14:34:26 EDT 2017

I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.

2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python’s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
for that.
"""

My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
flle formats.

This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
interface code)

If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.

This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.

For Python use -- a pointer to a Python string would be nice.

Then use a native flexible-encoding dtype for everything else.

Thinking out loud -- another option would be to set defaults for the
multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
with the python string type -- and make folks make an effort to get
anything else.

One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:

EncodingError if it can't be encoded into the defined encoding.

ValueError if it is too long -- it should not be silently truncated.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170421/1652ab34/attachment.html>