[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 15:09:10 EST 2009

2009/11/27 Christopher Barker <Chris.Barker at noaa.gov>:
>
>> The point is that I don't think we can just decide to use Unicode or
>> Bytes in all places where PyString was used earlier.
>
> Agreed.

I only half agree. It seems to me that for almost all situations where
PyString was used, the right data type is a python3 string (which is
unicode). I realize there may be some few cases where it is
appropriate to use bytes, but I think there needs to be a compelling
reason for each one.

> In a way, unicode strings are a bit like arrays: they have an encoding
> associated with them (like a dtype in numpy). You can represent a given
> bit of text in multiple different arangements of bytes, but they are all
> supposed to mean the same thing and, if you know the encoding, you can
> convert between them. This is kind of like how one can represent 5 in
> any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any
> value represented by one dtype can be converted to all other dtypes, but
> many can. Just like encodings.

This is incorrect. Unicode objects do not have default encodings or
multiple internal representations (within a single python interpreter,
at least). Unicode objects use 2- or 4-byte internal representations
internally, but this is almost invisible to the user. Encodings only
become relevant when you want to convert a unicode object to a byte
stream. It is usually an error to store text in a byte stream (for it
to make sense you must provide some mechanism to specify the
encoding).

> Anyway, all this brings me to think about the use of strings in numpy in
> this way: if it is meant to be a human-readable piece of text, it should
> be a unicode object. If not, then it is bytes.
>
> So: "fromstring" and the like should, of course, work with bytes (though
> maybe buffers really...)

I think if you're going to call it fromstring, it should onvert from
strings (i.e. unicode strings). But really, I think it makes more
sense to rename it frombytes() and have it convert bytes objects. One
could then have
def fromstring(s, encoding="utf-8"):
    return frombytes(s.encode(encoding))
as a shortcut. Maybe ASCII makes more sense as a default encoding. But
really, think about where the user's going to get the srting: most of
the time it's coming from a disk file or a network stream, so it will
be a byte string already, so they should use frombytes.

>> To summarize the use cases I've ran across so far:
>>
>> 1) For 'S' dtype, I believe we use Bytes for the raw data and the
>>    interface.
>
> I don't think so here. 'S' is usually used to store human-readable
> strings, I'd certainly expect to be able to do:
>
> s_array = np.array(['this', 'that'], dtype='S10')
>
> And I'd expect it to work with non-literals that were unicode strings,
> i.e. human readable text. In fact, it's pretty rare that I'd ever want
> bytes here. So I'd see 'S' mapped to 'U' here.

+1

> Francesc Alted wrote:
>> the next  should still work:
>>
>> In [2]: s = np.array(['asa'], dtype="S10")
>>
>> In [3]: s[0]
>> Out[3]: 'asa'  # will become b'asa' in Python 3
>
> I don't like that -- I put in a string, and get a bytes object back?

I agree.

>> In [4]: s.dtype.itemsize
>> Out[4]: 10     # still 1-byte per element
>
> But what it the the strings passed in aren't representable in one byte
> per character? Do we define "S" as only supporting ANSI-only string?
> what encoding?

Itemsize will change. That's fine.

>> 3) Format strings
>>
>>       a = array([], dtype=b'i4')
>>
>> I don't think it makes sense to handle format strings in Unicode
>> internally -- they should always be coerced to bytes.
>
> This should be fine -- we control what is a valid format string, and
> thus they can always be ASCII-safe.

I have to disagree. Why should we force the user to use bytes? The
format strings are just that, strings, and we should be able to supply
python strings to them. Keep in mind that "coercing" strings to bytes
requires extra information, namely the encoding. If you want to
emulate python2's value-dependent coercion - raise an exception only
if non-ASCII is present - keep in mind that python3 is specifically
removing that behaviour because of the problems it caused.

Anne