[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 14:23:11 EDT 2017

On Thu, 20 Apr 2017 10:26:13 -0700
Stephan Hoyer <shoyer at gmail.com> wrote:
> 
> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
> 
> In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
> 
> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.  

I think you want at least: ascii, utf8, ucs2 (aka utf16 without
surrogates), utf32.  That is, 3 common fixed width encodings and one
variable width encoding.

Regards

Antoine.