[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 12:01:05 EDT 2017

On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern at gmail.com> wrote:

> Chris, you've mashed all of my emails together, some of them are in reply
> to you, some in reply to others. Unfortunately, this dropped a lot of the
> context from each of them, and appears to be creating some
> misunderstandings about what each person is advocating.
>

Sorry about that -- I was trying to keep an already really long thread from
getting eve3n longer....

And I'm not sure it matters who's doing the advocating, but rather *what*
is being advocated -- I hope I didn't screw that up too badly.

Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.

So I'll try again -- use-case only! we'll keep the possible solutions
separate.

Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less
characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py
and the like, or the ones writing Cython bindings to C++ libs.

2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not to be wasting space for "typical
european-language-oriented data". Note: this should ALSO be compatible with
Python's character-oriented string model. i.e. a Python String with length
N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise
an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned
with the size of the data storage and know that are using european text.

3) dtypes that support storage in particular encodings:

   Python strings would be encoded appropriately when put into the array. A
Python string would be returned when indexing.

   a) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange           with other systems (netcdf, HDF,
others???) at the binary level.

   b) There be a dtype that could store data in any encoding supported by
Python -- to facilitate bytes-level interchange with other systems. If we
need more than utf-8, then we might as well have the full set.

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object (or other memoryview?),
and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding
with no change in binary representation. This could be used to store any
binary data, including encoded text or anything else. this should map
directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that
with 'S', a if ALL the trailing bytes are null, then it is truncated, but
if there is a null byte in the middle, then it is preserved. I suspect that
this is a legacy from Py2's use of "strings" as both text and binary data.
But in py3, a "bytes" type should be about bytes, and not text, and thus
null-values bytes are simply another value a byte can hold.

There are multiple ways to address these use cases -- please try to make
your comments clear about whether you think the use-case is unimportant, or
ill-defined, or if you think a given solution is a poor choice.

To facilitate that, I will put my comments on possible solutions in a
separate note, too.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/65d9fd21/attachment.html>