[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 16:06:06 EDT 2017

On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern <robert.kern at gmail.com>
wrote:
>>
>> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov>
wrote:
>>
>> >> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> >
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.

Can you walk us through the problems that you are having with working with
these columns as arrays of `bytes`?

> So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

Well, I outlined a solution: work with `bytes` arrays with utilities to
convert to/from the Unicode-aware string dtypes (or `object`).

A UTF-8-specific dtype and maybe a string-specialized `object` dtype
address the very real and consequential problems that I face (namely and
respectively, working with HDF5 and in-memory manipulation of string
datasets).

I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-
been-warned-you're-gonna-get-mojibake option. It should not be *the*
Unicode string dtype (i.e. named np.realstring or np.unicode as in the
original proposal).

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/2a34561a/attachment.html>