[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 05:27:00 EST 2009

pe, 2009-11-27 kello 11:17 +0100, Francesc Alted kirjoitti:
> A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué:
> > 1) For 'S' dtype, I believe we use Bytes for the raw data and the
> >    interface.
> > 
> >    Maybe we want to introduce a separate "bytes" dtype that's an alias
> >    for 'S'?
> 
> Yeah.  As regular strings in Python 3 are Unicode, I think that introducing 
> separate "bytes" dtype would help doing the transition.  Meanwhile, the next 
> should still work:
> 
> In [2]: s = np.array(['asa'], dtype="S10")
> 
> In [3]: s[0]
> Out[3]: 'asa'  # will become b'asa' in Python 3
> 
> In [4]: s.dtype.itemsize
> Out[4]: 10     # still 1-byte per element

Yes. But now I wonder, should

	array(['foo'], str)
	array(['foo'])

be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
which will mean unavoidable code breakage -- there's probably no
avoiding it.

[clip]
> Also, I suppose that there will be issues with the current Unicode support in 
> NumPy:
> 
> In [5]: u = np.array(['asa'], dtype="U10")
> 
> In [6]: u[0]
> Out[6]: u'asa'  # will become 'asa' in Python 3
> 
> In [7]: u.dtype.itemsize
> Out[7]: 40      # not sure about the size in Python 3

I suspect the Unicode stuff will keep working without major changes,
except maybe dropping the u in repr. It is difficult to believe the
CPython guys would have significantly changed the current Unicode
implementation, if they didn't bother changing the names of the
functions :)

> For example, if it is true that internal strings in Python 3 and Unicode UTF-8 
> (as René seems to suggest), I suppose that the internal conversions from 2-
> bytes or 4-bytes (depending on how the Python interpreter has been compiled) 
> in NumPy Unicode dtype to the new Python string should have to be reworked 
> (perhaps you have dealt with that already).

I don't think they are internally UTF-8:
http://docs.python.org/3.1/c-api/unicode.html

"""Python’s default builds use a 16-bit type for Py_UNICODE and store
Unicode values internally as UCS2."""

-- 
Pauli Virtanen