[Numpy-discussion] String type again.

Mon Jul 14 16:13:00 EDT 2014

On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:

> As previous posts have pointed out, Numpy's `S` type is currently treated
> as a byte string, which leads to more complicated code in python3.
>

Also, a byte string in py3 is not, in fact the same as the py2 string type.
So we have a problem -- if we want 'S' to mean what it essentially does in
py2, what do we map it to in pure-python land?

I propose we embrace the py3 model as fully as possible:

There is text data, and there is binary data. In py3, that is 'str' and
'bytes'.

So numpy should have dtypes to match these. We're a bit stuck, however,
because 'S' mapped to the py2 string type, which no longer exists in py3.
Sorry not running py3 to see what 'S' does now, but I know it's bit broken,
and may be too late to change it.

But: it is certainly a common case in the scientific world to have
1-byte-per-character string data, and care about store size. So a
1-byte-per-character text data types may be a good idea:

As for a bytes type -- do we need it, or are we fine with simply using
uint8 arrays? (or, even the most common case, converting directly to the
type that is actually stored in those bytes...

> especially for ascii strings. This note proposes to adapt the currently
> existing 'a' type letter, currently aliased to 'S', as a new fixed encoding
> dtype.
>

+1

> Python 3.3 introduced two one byte internal representations for unicode
> strings, ascii and latin1. Ascii has the advantage that it is a subset of
> UTF-8, whereas latin1 has a few more symbols.
>

+1 for latin-1 -- those extra symbols are handy. Also, at least with
Python's stdlib encoding, you can round-trip any binary data through
latin-1 -- kind of making it act like a bytes object....

> Another possibility is to just make it an UTF-8 encoding, but I think this
> would involve more overhead as Python would need to determine the maximum
> character size.
>

yeah -- that is a) overhead, and b) breaks the numpy fixed size dtype
model. And it's trickier for numpy arrays, 'cause they are mutable --
python strings can do OK, as they don't need to accommodate potentially
changing sizes of strings.

On Sat, Jul 12, 2014 at 5:02 PM, Nathaniel Smith <njs at pobox.com> wrote:

> I feel like for most purposes, what we *really* want is a variable length
> string dtype (I.e., where each element can be a different length.).

well, that is fundamentally different than the usual numpy data model -- it
would require that the array store pointers and dereference them on use --
is there anywhere else in numpy (other than the object dtype ) that does
that?

And if we did -- would it end up having any advantage over putting strings
in an object array? Or for that matter, using a list of strings instead?

> Pandas pays quite some price in overhead to fake this right now. Adding
> such a thing will cause some problems regarding compatibility (what to do
> with array(["foo"])) and education, but I think it's worth it in the long
> run.

i.e do you use the fixed-length type or the variable-length type? I'm not
sure it's to killer to have a default and let eh user set a dtype if they
want something else.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140714/b6d96575/attachment.html>