[Numpy-discussion] String type again.

Tue Jul 15 14:40:58 EDT 2014

On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On 12 Jul 2014 23:06, "Charles R Harris" <charlesr.harris at gmail.com>
> wrote:
> >
> > As previous posts have pointed out, Numpy's `S` type is currently
> treated as a byte string, which leads to more complicated code in python3.
> OTOH, the unicode type is stored as UCS4, which consumes a lot of space,
> especially for ascii strings. This note proposes to adapt the currently
> existing 'a' type letter, currently aliased to 'S', as a new fixed encoding
> dtype. Python 3.3 introduced two one byte internal representations for
> unicode strings, ascii and latin1. Ascii has the advantage that it is a
> subset of UTF-8, whereas latin1 has a few more symbols. Another possibility
> is to just make it an UTF-8 encoding, but I think this would involve more
> overhead as Python would need to determine the maximum character size.
> These are just preliminary thoughts, comments are welcome.
>
> I feel like for most purposes, what we *really* want is a variable length
> string dtype (I.e., where each element can be a different length.). Pandas
> pays quite some price in overhead to fake this right now. Adding such a
> thing will cause some problems regarding compatibility (what to do with
> array(["foo"])) and education, but I think it's worth it in the long run. A
> variable length string with out of band storage also would allow for a lot
> of py3.3-style storage tricks of we want then.
>
> Given that, though, I'm a little dubious about adding a third fixed length
> string type, since it seems like it might be a temporary patch, yet raises
> the prospect of having to indefinitely support *5* distinct string types (3
> of which will map to py3 str)...
>
> OTOH, fixed length nul padded latin1 would be useful for various flat file
> reading tasks.
>
As one of the original agitators for this, let me re-iterate that what the
astronomical community *really* wants is the original proposal as described
by Chris Barker [1] and essentially what Charles said.  We have large data
archives that have ASCII string data in binary formats like FITS and HDF5.
 The current readers for those datasets present users with numpy S data
types, which in Python 3 cannot be compared to str (unicode) literals.  In
many cases those datasets are large, and in my case I regularly deal with
multi-Gb sized bytestring arrays.  Converting those to a U dtype is not
practical.

This issue is the sole blocker that I personally have in beginning to move
our operations code base to be Python 3 compatible, and eventually actually
baselining Python 3.

A variable length string would be great, but it feels like a different (and
more difficult) problem to me.  If, however, this can be the solution to
the problem I described, and it can be implemented in a finite time, then
I'm all for it!  :-)

I hate begging for features with no chance of contributing much to the
implementation (lacking the necessary expertise in numpy internals).  I
would be happy to draft a NEP if that will help the process.

Cheers,
Tom

[1]:
http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html

> -n
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140715/8cfd2da8/attachment.html>