[Numpy-discussion] proposal: smaller representation of string arrays

josef.pktd at gmail.com josef.pktd at gmail.com
Wed Apr 26 15:03:33 EDT 2017


On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal"
> <chris.barker at noaa.gov> wrote:
>
>
> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague? The "character-oriented Python text model" is
> just that str supports O(1) indexing of characters. But... Numpy doesn't. If
> you want to access individual characters inside a string inside an array,
> you have to pull out the scalar first, at which point the data is copied and
> boxed into a Python object anyway, using whatever representation the
> interpreter prefers. So AFAICT it makes literally no difference to the user
> whether numpy's internal representation allows for fast character access.

you can create a view on individual characters or bytes, AFAICS

>>> t = np.array(['abcdefg']*10)
>>> t2 = t.view([('s%d' % i, '<U1') for i in range(7)])
>>> t2['s5']
array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'],
      dtype='<U1')


>>> t.view('<U1').reshape(len(t), -1)[:, 2]
array(['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c'],
      dtype='<U1')


Josef

>
> -n
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list