[Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer shoyer at gmail.com
Mon Apr 24 13:51:55 EDT 2017


On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> latin-1 or latin-9 buys you (over ASCII):
>
> ...
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.


> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>

Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
>

One advantage of a new (non-default) dtype is that we can change this
behavior.


> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>

It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/9ff5ca44/attachment-0001.html>


More information about the NumPy-Discussion mailing list