[Numpy-discussion] String & unicode arrays vs text loading in python 3

Tue Sep 13 10:21:57 EDT 2016

Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

BTW, by identical I mean having two externally visible types, but a common
implementation in python 3 (that of NPY_UNICODE).

The as-sane but not backwards-compatible option (I'm asking if this is
acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
np.string_).

Cheers,
  Lluis