[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 12:52:06 EDT 2017

OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with
> Python3's string model: i.e. fully unicode supporting, and with a character
> oriented interface. i.e. if you do::
>
>   arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
> characters.
>
> and arr[1] will return a native Python3 string object.
>
> This is the use-case for "casual" numpy users -- not the folks writing
> H5py and the like, or the ones writing Cython bindings to C++ libs.
>

I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

    -I take it that's what Pandas does and people seem happy.

    -That would get us variable length strings, and potentially other nifty
string-processing.

   - It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?

   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.

That's it for use-case (1)

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/2761b3f0/attachment.html>