[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 20:56:43 EDT 2017

On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
> aldcroft at head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> I am not unfamiliar with this problem. I still work with files that
> have fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
> >>
> >> Can you walk us through the problems that you are having with working
> with these columns as arrays of `bytes`?
> >
> > This is very simple and obvious but I will state for the record.
>
> I appreciate it. What is obvious to you is not obvious to me.
>
> > Reading an HDF5 file with character data currently gives arrays of
> `bytes` [1].  In Py3 this cannot be compared to a string literal, and
> comparing to (or assigning from) explicit byte strings everywhere in the
> code quickly spins out of control.  This generally forces one to convert
> the data to `U` type and incur the 4x memory bloat.
> >
> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
> >
> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> > Out[23]: False
> >
> > In [24]: dat == b'yes'  # Right answer but not practical
> > Out[24]: array([ True, False], dtype=bool)
>
> I'm curious why you think this is not practical. It seems like a very
> practical solution to me.
>

In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

- Tom

>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/f0cb6257/attachment.html>