[Numpy-discussion] String accessor methods

Sebastian Berg sebastian at sipsolutions.net
Sun Mar 7 11:16:36 EST 2021


On Sun, 2021-03-07 at 09:34 +0000, Kevin Sheppard wrote:
> I think that and string functions that are exposed from an ndarray
> would
> have to be guaranteed to work in-place. Requiring casting to objects
> to use
> the methods feels more like syntactic sugar than an essential case. I
> think
> most of the ones mentioned are low performance and can't take
> advantage of
> the storage as a blob of int8 (ascii) or int32 (utf32) that underlay
> Numpy
> string arrays.
> 
> I also think the existence of these in pandas reduces the case for
> them
> being in Numpy.

I agree with this, the need seems much lower in NumPy. And NumPy's
currently somewhat weird strings at least for me makes it even less
appealing to expose more string utilities of any kind at this time.

In general, there is probably something to be said about such
"accessor", in the sense of having a place to put methods which are
specific to the array's dtype.  Other examples are datetime/timedelta
or Units and probably many potential DTypes [1]. It is one advantage
that the `astropy.units.Quantity` subclass has over a DType based
solution: `methods` can be added very transparently.

Basically: The current `np.char` functions are a bit weird and I would
need a quite a bit more convincing to expose them at this time.
But, I would be delighted if we can think of a solution that goes
beyond `str` [2].  Probably not adding `ndarray.str` at all or only if
the array has a string DType.
But do it in way that generalizes!  That could be a DType specific
mixin class, or I had previously played with the thought of a "generic"
accessor:
    `ndarray.elementwise.<ufuncs-provided-by-DType>`

But those go beyond the original string request and need some smart
idea/thoughts!

An interesting aside is that `arr.imag` and `arr.real` fall into the
same category. But they are narrow enough that we can just have a
specific solution for them.

Cheers,

Sebastian



[1] Datetimes/timedelta might have some use of basic timezone handling
(not sure if relevant to NumPy's naive datetimes).

`astropy.units.Quantity` has a few extra methods/properties:

* `.cgs`, `.si`, `.decompose()`, `.to()`: cast to different unit.
* `.unit`
* `.value`: get a value array view without any unit.
* `.to_value()` method that returns a copy, not a view.

Of course we can spell those using DTypes, but I think it might be
long: `arr.astype(arr.dtype.cgs)`, or `arr.view(arr.dtype.unitless)`.
Utility functions similar to `np.char` also can simplify all of this,
but methods do have merit.
Other user DTypes could very well have more compelling use-cases.


[2] But it probably won't reach my serious thinking cycles for a while.
For starters, dedicated utility functions seem decent enough...


> 
> On Sun, Mar 7, 2021, 05:32 Todd <toddrjen at gmail.com> wrote:
> 
> > On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <
> > dan_patterson at outlook.com>
> > wrote:
> > 
> > > The are  in np.char
> > > 
> > > mystr = np.array(["test first", "test second", "test third"])
> > > 
> > > np.char.title(mystr)
> > > array(['Test First', 'Test Second', 'Test Third'], dtype='<U11')
> > > 
> > 
> > I mentioned those in my email, but they are far less convenient to
> > use
> > than class methods, nor do they relate well to how built-in strings
> > are
> > used in Python. That is why other projects have started using
> > accessor
> > methods and why Python removed all the separate string functions in
> > Python
> > 3. The functions in np.char are also limited in their capabilities,
> > and
> > fairly poorly documented in my opinion.  Some of those limitations
> > are
> > impossible to overcome, for example they inherently can never
> > support
> > operators, addition or multiplication, or slicing like Python
> > strings can,
> > while an accessor could.
> > 
> > However, putting them as top-level methods for ndarray would
> > pollute the
> > methods too much. That is why I am suggesting numpy do the same
> > thing that
> > pandas, xarray, etc. are doing and putting those as methods under a
> > 'str'
> > attribute for ndarrays rather than as separate functions.
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210307/37ed5d6d/attachment.sig>


More information about the NumPy-Discussion mailing list