[Numpy-discussion] Bytes vs. Unicode in Python3
Pauli Virtanen
pav at iki.fi
Fri Nov 27 05:27:00 EST 2009
pe, 2009-11-27 kello 11:17 +0100, Francesc Alted kirjoitti:
> A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué:
> > 1) For 'S' dtype, I believe we use Bytes for the raw data and the
> > interface.
> >
> > Maybe we want to introduce a separate "bytes" dtype that's an alias
> > for 'S'?
>
> Yeah. As regular strings in Python 3 are Unicode, I think that introducing
> separate "bytes" dtype would help doing the transition. Meanwhile, the next
> should still work:
>
> In [2]: s = np.array(['asa'], dtype="S10")
>
> In [3]: s[0]
> Out[3]: 'asa' # will become b'asa' in Python 3
>
> In [4]: s.dtype.itemsize
> Out[4]: 10 # still 1-byte per element
Yes. But now I wonder, should
array(['foo'], str)
array(['foo'])
be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
which will mean unavoidable code breakage -- there's probably no
avoiding it.
[clip]
> Also, I suppose that there will be issues with the current Unicode support in
> NumPy:
>
> In [5]: u = np.array(['asa'], dtype="U10")
>
> In [6]: u[0]
> Out[6]: u'asa' # will become 'asa' in Python 3
>
> In [7]: u.dtype.itemsize
> Out[7]: 40 # not sure about the size in Python 3
I suspect the Unicode stuff will keep working without major changes,
except maybe dropping the u in repr. It is difficult to believe the
CPython guys would have significantly changed the current Unicode
implementation, if they didn't bother changing the names of the
functions :)
> For example, if it is true that internal strings in Python 3 and Unicode UTF-8
> (as René seems to suggest), I suppose that the internal conversions from 2-
> bytes or 4-bytes (depending on how the Python interpreter has been compiled)
> in NumPy Unicode dtype to the new Python string should have to be reworked
> (perhaps you have dealt with that already).
I don't think they are internally UTF-8:
http://docs.python.org/3.1/c-api/unicode.html
"""Python’s default builds use a 16-bit type for Py_UNICODE and store
Unicode values internally as UCS2."""
--
Pauli Virtanen
More information about the NumPy-Discussion
mailing list