[Numpy-discussion] String & unicode arrays vs text loading in python 3

Tue Sep 13 12:55:38 EDT 2016

We had a big long discussion about this on this list a while back (maybe 2
yrs ago???) please search the archives to find it. Though I'm pretty sure
that we never did come to a conclusion. I think it stared with wanting
better support ofr unicode in loadtxt and the like, and ended up delving
into other encodings for the 'U' dtype, and maybe a single byte string
dtype (latin-1), or maybe a variable-size unicode object like Py3's, or...

However, it is absolutely a non-starter to change the binary representation
of the 'S' type in any version of numpy. Due to the legacy of py2 (and,
indeed, most computing environments) 'S' is a single byte string
representation. And the binary representation is often really key to numpy
use.
Period, end of story.

And that maps to a py2 string and py3 bytes object.

py2 does, of course, have a Unicode object as well. If you want your code
(and doctests, and ...) to be compatible, then you should probably go to
Unicode strings everywhere. py3 now supports the u'string' no-op literal to
make this easier.

(though I guess the __repr__ won't tack on that 'u', which is going to be a
problem for docstrings).

Note also that py3 has added more an more "string-like" support to the
bytes object, so it's not too bad to go bytes-only.

-CHB

On Tue, Sep 13, 2016 at 7:21 AM, Lluís Vilanova <vilanova at ac.upc.edu> wrote:

> Sebastian Berg writes:
>
> > On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
> >> Hi! I'm giving a shot to issue #3184 [1], based on the observation
> >> that the
> >> string dtype ('S') under python 3 uses byte arrays instead of unicode
> >> (the only
> >> readable string type in python 3).
> >>
> >> This brings two major problems:
> >>
> >> * numpy code has to go through loops to open and read files as binary
> >> data to
> >>   load text into a bytes array, and does not play well with users
> >> providing
> >>   string (unicode) arguments
> >>
> >> * the repr of these arrays shows strings as b'text' instead of
> >> 'text', which
> >>   breaks doctests of software built on numpy
> >>
> >> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
> >> (NPY_STRING and
> >> NPY_UNICODE).
> >>
> >> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
> >> internal
> >> implementation) will provide the best backwards compatibility, but is
> >> more
> >> cumbersome to implement.
>
> > I am not sure how that can be possible. Those types are fundamentally
> > different in how they store their data. String types use one byte per
> > character, unicode types will use 4 bytes per character. You can maybe
> > default to unicode in more cases in python 3, but you cannot make them
> > identical internally.
>
> BTW, by identical I mean having two externally visible types, but a common
> implementation in python 3 (that of NPY_UNICODE).
>
> The as-sane but not backwards-compatible option (I'm asking if this is
> acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
> implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
> np.string_).
>
>
> Cheers,
>   Lluis
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160913/d0bda2df/attachment.html>