[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 14:46:36 EDT 2017

Chuck: That sounds like something we want to deprecate, for the same reason
that python3 no longer allows str(b'123') to do the right thing.

Specifically, it seems like astype should always be forbidden to go between
unicode and byte arrays - so that would need to be written as:

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('U[ascii]')
Out[3]:
array([u'1', u'2', u'3'],
      dtype='<U[ascii]1')

In [4]: a.view('U[ascii]').astype('U[ucs32]')  # re-encoding is a
astype operation
Out[4]:
array([u'1', u'2', u'3'],
      dtype='<U1')     # UCS32 is the current default

In [5]: a.view('U[ascii]').astype('U[ucs32]').view(uint8)
Out [5]:
array([0x31, 0, 0, 0, 0x32, 0, 0, 0, 0x33, 0, 0, 0])

I guess for backwards compatibility, .view('U') would always mean
view('U[ucs32]').

As an aside - it’d be nice if parameterized dtypes acquired a non-string
syntax, like np.unicode_['ucs32'].

Eric

On Tue, 25 Apr 2017 at 19:19 Charles R Harris charlesr.harris at gmail.com
<http://mailto:charlesr.harris@gmail.com> wrote:

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted at gmail.com>
> wrote:
>
>>
>> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern at gmail.com>
>> wrote:
>>
>>> * HDF5 supports fixed-length and variable-length string arrays encoded
>>> in ASCII and UTF-8. In all cases, these strings are NULL-terminated
>>> (despite the documentation claiming that there are more options). In
>>> practice, the ASCII strings permit high-bit characters, but the encoding is
>>> unspecified. Memory-mapping is rare (but apparently possible). The two
>>> major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to
>>> support that HDF5 option. Compression is supported for fixed-length string
>>> arrays but not variable-length string arrays.
>>>
>>> * FITS supports fixed-length string arrays that are NULL-padded. The
>>> strings do not have a formal encoding, but in practice, they are typically
>>> mostly ASCII characters with the occasional high-bit character from an
>>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>>> be quite large even if each scalar is reasonably small.
>>>
>>> * pandas uses object arrays for flexible in-memory handling of string
>>> columns. Lengths are not fixed, and None is used as a marker for missing
>>> data. String columns must be written to and read from a variety of formats,
>>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>>> with `unicode/str` objects instead of `bytes`.
>>>
>>> * There are a number of sometimes-poorly-documented,
>>> often-poorly-adhered-to, aging file format "standards" that include string
>>> arrays but do not specify encodings, or such specification is ignored in
>>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>>> difficult to perform.
>>>
>>> * In Python 3 environments, `unicode/str` objects are rather more
>>> common, and simple operations like equality comparisons no longer work
>>> between `bytes` and `unicode/str`, making it difficult to work with numpy
>>> string arrays that yield `bytes` scalars.
>>>
>>
>> It seems the greatest challenge is interacting with binary data from
>> other programs and libraries. If we were living entirely in our own data
>> world, Unicode strings in object arrays would generally be pretty
>> satisfactory. So let's try to get what is needed to read and write other
>> people's formats.
>>
>> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
>> map directly to numpy arrays; we can store it however we want, as
>> conversion is necessary anyway.
>>
>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> For  byte strings, it looks like we need a parameterized type. This is for
> two uses, display and conversion to (Python) unicode. One could handle the
> display and conversion using view and astype methods. For instance, we
> already have
>
> In [1]: a = array([1,2,3], uint8) + 0x30
>
> In [2]: a.view('S1')
> Out[2]:
> array(['1', '2', '3'],
>       dtype='|S1')
>
> In [3]: a.view('S1').astype('U')
> Out[3]:
> array([u'1', u'2', u'3'],
>       dtype='<U1')
>
> Chuck
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a7605e49/attachment-0001.html>