[Numpy-discussion] proposal: smaller representation of string arrays
Eric Wieser
wieser.eric+numpy at gmail.com
Wed Apr 26 11:39:46 EDT 2017
> I think we can implement viewers for strings as ndarray subclasses. Then one
> could
> do `my_string_array.view(latin_1)`, and so on. Essentially that just
> changes the default
> encoding of the 'S' array. That could also work for uint8 arrays if needed.
>
> Chuck
To handle structured data-types containing encoded strings, we'd also
need to subclass `np.void`.
Things would get messy when a structured dtype contains two strings in
different encodings (or more likely, one bytestring and one
textstring) - we'd need some way to specify which fields are in which
encoding, and using subclasses means that this can't be contained
within the dtype information.
So I think there's a strong argument for solving this with`dtype`s
rather than subclasses. This really doesn't seem hard though.
Something like (C-but-as-python):
def ENCSTRING_getitem(ptr, arr): # The PyArrFuncs slot
encoded = STRING_getitem(ptr, arr)
return encoded.decode(arr.dtype.encoding)
def ENCSTRING_setitem(val, ptr, arr): # The PyArrFuncs slot
val = val.encode(arr.dtype.encoding)
# todo: handle "safe" truncation, where safe might mean keep
codepoints, keep graphemes, or never allow
STRING_setitem(val, ptr, arr))
We'd probably need to be careful to do a decode/encode dance when
copying from one encoding to another, but we [already have
bugs](https://github.com/numpy/numpy/issues/3258) in those cases
anyway.
Is it reasonable that the user of such an array would want to work
with plain `builtin.unicode` objects, rather than some special numpy
scalar type?
Eric
More information about the NumPy-Discussion
mailing list