[Numpy-discussion] String type again.
Chris Barker
chris.barker at noaa.gov
Mon Jul 14 16:13:00 EDT 2014
On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
> As previous posts have pointed out, Numpy's `S` type is currently treated
> as a byte string, which leads to more complicated code in python3.
>
Also, a byte string in py3 is not, in fact the same as the py2 string type.
So we have a problem -- if we want 'S' to mean what it essentially does in
py2, what do we map it to in pure-python land?
I propose we embrace the py3 model as fully as possible:
There is text data, and there is binary data. In py3, that is 'str' and
'bytes'.
So numpy should have dtypes to match these. We're a bit stuck, however,
because 'S' mapped to the py2 string type, which no longer exists in py3.
Sorry not running py3 to see what 'S' does now, but I know it's bit broken,
and may be too late to change it.
But: it is certainly a common case in the scientific world to have
1-byte-per-character string data, and care about store size. So a
1-byte-per-character text data types may be a good idea:
As for a bytes type -- do we need it, or are we fine with simply using
uint8 arrays? (or, even the most common case, converting directly to the
type that is actually stored in those bytes...
> especially for ascii strings. This note proposes to adapt the currently
> existing 'a' type letter, currently aliased to 'S', as a new fixed encoding
> dtype.
>
+1
> Python 3.3 introduced two one byte internal representations for unicode
> strings, ascii and latin1. Ascii has the advantage that it is a subset of
> UTF-8, whereas latin1 has a few more symbols.
>
+1 for latin-1 -- those extra symbols are handy. Also, at least with
Python's stdlib encoding, you can round-trip any binary data through
latin-1 -- kind of making it act like a bytes object....
> Another possibility is to just make it an UTF-8 encoding, but I think this
> would involve more overhead as Python would need to determine the maximum
> character size.
>
yeah -- that is a) overhead, and b) breaks the numpy fixed size dtype
model. And it's trickier for numpy arrays, 'cause they are mutable --
python strings can do OK, as they don't need to accommodate potentially
changing sizes of strings.
On Sat, Jul 12, 2014 at 5:02 PM, Nathaniel Smith <njs at pobox.com> wrote:
> I feel like for most purposes, what we *really* want is a variable length
> string dtype (I.e., where each element can be a different length.).
well, that is fundamentally different than the usual numpy data model -- it
would require that the array store pointers and dereference them on use --
is there anywhere else in numpy (other than the object dtype ) that does
that?
And if we did -- would it end up having any advantage over putting strings
in an object array? Or for that matter, using a list of strings instead?
> Pandas pays quite some price in overhead to fake this right now. Adding
> such a thing will cause some problems regarding compatibility (what to do
> with array(["foo"])) and education, but I think it's worth it in the long
> run.
i.e do you use the fixed-length type or the variable-length type? I'm not
sure it's to killer to have a default and let eh user set a dtype if they
want something else.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140714/b6d96575/attachment.html>
More information about the NumPy-Discussion
mailing list