[Numpy-discussion] [SciPy-dev] Deprecate chararray [was Plea for help]
Michael Droettboom
mdroe at stsci.edu
Tue Sep 22 13:58:23 EDT 2009
Sorry to resurrect a long-dead thread, but I've been continuing Chris
Hanley's investigation of chararray at Space Telescope Science Institute
(and the broader astronomical community) for a while and have some
findings to report back.
What I've taken from this thread is that chararray is in need of a
maintainer. I am able to spend some time to the cause, but first would
like to clarify what it will take to make it's continued inclusion more
comfortable.
Let me start with the use case. chararrays are extensively returned
from pyfits (a tool to handle the standard astronomy data format).
pyfits is the basis of many applications, and it would be impossible to
audit all of that code. Most authors of those tools do not track
numpy-discussion closely, which is why we don't hear from them on this
list, but there is a great deal of pyfits-using code.
Doing some spot-checking on this code, a common thing I see is SQL-like
queries on recarrays of objects. For instance, it is very common to a
have a table of objects, with a "Target" column which is a string, and
do something like (where c is a chararray of the 'Target' column):
subset = array[np.where(c.startswith('NGC'))]
Strictly speaking, this is a use case for "vectorized string
operations", not necessarily for the chararray class as it presently
stands. One could almost as easily do:
subset = array[np.where([x.startswith('NGC') for x in c])]
...and the latter is even slightly faster, since chararray currently
loops in Python anyway.
Even better, though, I have some experimental code to perform the loop
in C, and I get 5x speed up on a table with ~120,000 rows. If that were
to be included in numpy, that's a strong argument against recommending
list comprehensions in user code. The use case suggests the continued
existence of vectorized string operations in numpy -- whether that
continues to be chararray, or some newer/better interface + chararray
for backward compatibility, is an open question. Personally I think a
less object-oriented approach and just having a namespace full of
vectorized string functions might be cleaner than the current situation
of needing to create a view class around an ndarray. I'm suggesting
something like the following, using the same example, where {STR} is
some namespace we would fill with vectorized string operations:
subset = array[np.where(np.{STR}.startswith(c, 'NGC'))]
Now on to chararray as it now stands. I view chararray as really two
separable pieces of functionality:
1) Convenience to perform vectorized string operations using
'.method' syntax, or in some cases infix operators (+, *)
2) Implicit "rstrip"ping of values
(Note that raw ndarray's truncate values at the first NULL character,
like C strings, but chararrays will strip any and all whitespace
characters from the end).
Changing (2) just seems to be asking to be the source of subtle bugs.
Unfortunately, there's an inconsistency between 1) and 2) in the present
implementation. For example:
In [9]: a = np.char.array(['a '])
In [10]: a
Out[10]: chararray(['a'], dtype='|S3')
In [11]: a[0] == 'a'
Out[11]: True
In [12]: a.endswith('a')
Out[12]: array([False], dtype=bool)
This is *the* design wart of chararray, IMHO, and one that's difficult
to fix without breaking compatibility. It might be a worthwhile
experiment to remove (2) and see how much we really break, but it would
be impossible to know for sure.
Now to address the concerns iterated in this thread. Unfortunately, I
don't know where this thread began before it landed on the Numpy list,
so I may be missing details which would help me address them.
> 0) "it gets very little use" (an assumption you presumably dispute);
>
Certainly not true from where I stand.
> 1) "is pretty much undocumented" (less true than a week ago, but still true for several of the attributes, with another handful or so falling into the category of "poorly documented");
>
I don't quite understand this one -- 99% of the methods are wrappers
around standard Python string methods. I don't think we should
redocument those. I agree it needs a better top level docstring about
its purpose (see functionalities (1) and (2) above) and its status (for
backward compatibility).
> 2) "probably more buggy than most other parts of NumPy" ("probably" being a euphemism, IMO);
>
Trac has these bugs. Any others?
http://projects.scipy.org/numpy/ticket/1199
http://projects.scipy.org/numpy/ticket/1200
http://projects.scipy.org/numpy/ticket/856
http://projects.scipy.org/numpy/ticket/855
http://projects.scipy.org/numpy/ticket/1231
> 3) "there is not a really good use-case for it" (a conjecture, but one that has yet to be challenged by counter-example);
>
See above.
> 4) it's not the first time its presence in NumPy has been questioned ("as Stefan pointed out when asking this same question last year")
>
Hopefully we're addressing that now.
> 5) NumPy already has a (perhaps superior) alternative ("object arrays would do nicely if one needs this functionality");
>
No -- that gives the problem of even slower Python-looping to do
vectorized string operations.
> to which I'll add:
>
> 6) it is, on its face, "counter to the spirit" of NumPy.
>
I don't quite know what this means -- but I do find the fact that it's a
view class with methods a little bit clumsy. Is that what you meant?
So here's my TODO list related to all this:
1) Fix bugs in Trac
2) Improve documentation (though probably not in a method-by-method way)
3) Improve unit test coverage
4a) Create C-based vectorized string operations
4b) Refactor chararray in terms of those
4c) Design and create an interface to those methods that will be the
"right way" going forward
Anything else?
Mike
--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA
More information about the NumPy-Discussion
mailing list