[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 15:38:19 EDT 2017

On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
>> peridot.faceted at gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>> >
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
>> towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
>> me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
>> N non-NULL bytes. Any extra space left over is padded with NULLs, but no
>> space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be up
>> to N-1 non-NULL bytes. There must always be space reserved for the
>> terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
>> embedded NULLs. It's certainly possible to deal with them: just strip
>> trailing NULLs and leave any embedded ones alone. But I'm also sure that
>> there are some implementations somewhere that interpret the requirement as
>> "stop at the first NULL or the end of the fixed width, whichever comes
>> first", effectively being NULL-terminated just not requiring the reserved
>> space.
>>
>
> Thanks for the clarification. NULL-padded is what I meant.
>
> I'm wondering how much of the desired functionality we could get by simply
> subclassing ndarray in python. I think we mostly want to be able to view
> byte strings and convert to unicode if needed.
>
>
And I think the really tricky part is sorting and rich comparison.
Unfortunately, the comparison function is currently located in the c
structure. I suppose we could define a c wrapper function to go in the slot.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/a5b45623/attachment-0001.html>