[Numpy-discussion] [SciPy-dev] Deprecate chararray [was Plea for help]

Tue Sep 22 13:58:23 EDT 2009

Sorry to resurrect a long-dead thread, but I've been continuing Chris 
Hanley's investigation of chararray at Space Telescope Science Institute 
(and the broader astronomical community) for a while and have some 
findings to report back.

What I've taken from this thread is that chararray is in need of a 
maintainer.  I am able to spend some time to the cause, but first would 
like to clarify what it will take to make it's continued inclusion more 
comfortable.

Let me start with the use case.  chararrays are extensively returned 
from pyfits (a tool to handle the standard astronomy data format).  
pyfits is the basis of many applications, and it would be impossible to 
audit all of that code.  Most authors of those tools do not track 
numpy-discussion closely, which is why we don't hear from them on this 
list, but there is a great deal of pyfits-using code. 

Doing some spot-checking on this code, a common thing I see is SQL-like 
queries on recarrays of objects.  For instance, it is very common to a 
have a table of objects, with a "Target" column which is a string, and 
do something like (where c is a chararray of the 'Target' column):

   subset = array[np.where(c.startswith('NGC'))]

Strictly speaking, this is a use case for "vectorized string 
operations", not necessarily for the chararray class as it presently 
stands.  One could almost as easily do:

   subset = array[np.where([x.startswith('NGC') for x in c])]

...and the latter is even slightly faster, since chararray currently 
loops in Python anyway.

Even better, though, I have some experimental code to perform the loop 
in C, and I get 5x speed up on a table with ~120,000 rows.  If that were 
to be included in numpy, that's a strong argument against recommending 
list comprehensions in user code.  The use case suggests the continued 
existence of vectorized string operations in numpy -- whether that 
continues to be chararray, or some newer/better interface + chararray 
for backward compatibility, is an open question.  Personally I think a 
less object-oriented approach and just having a namespace full of 
vectorized string functions might be cleaner than the current situation 
of needing to create a view class around an ndarray.  I'm suggesting 
something like the following, using the same example, where {STR} is 
some namespace we would fill with vectorized string operations:

   subset = array[np.where(np.{STR}.startswith(c, 'NGC'))]

Now on to chararray as it now stands.  I view chararray as really two 
separable pieces of functionality:

   1) Convenience to perform vectorized string operations using 
'.method' syntax, or in some cases infix operators (+, *)
   2) Implicit "rstrip"ping of values

(Note that raw ndarray's truncate values at the first NULL character, 
like C strings, but chararrays will strip any and all whitespace 
characters from the end).

Changing (2) just seems to be asking to be the source of subtle bugs.  
Unfortunately, there's an inconsistency between 1) and 2) in the present 
implementation.  For example:

In [9]: a = np.char.array(['a  '])

In [10]: a
Out[10]: chararray(['a'], dtype='|S3')

In [11]: a[0] == 'a'
Out[11]: True

In [12]: a.endswith('a')
Out[12]: array([False], dtype=bool)

This is *the* design wart of chararray, IMHO, and one that's difficult 
to fix without breaking compatibility.  It might be a worthwhile 
experiment to remove (2) and see how much we really break, but it would 
be impossible to know for sure.

Now to address the concerns iterated in this thread.  Unfortunately, I 
don't know where this thread began before it landed on the Numpy list, 
so I may be missing details which would help me address them.

> 0) "it gets very little use" (an assumption you presumably dispute);
>   
Certainly not true from where I stand.
> 1) "is pretty much undocumented" (less true than a week ago, but still true for several of the attributes, with another handful or so falling into the category of "poorly documented");
>   
I don't quite understand this one -- 99% of the methods are wrappers 
around standard Python string methods.  I don't think we should 
redocument those.  I agree it needs a better top level docstring about 
its purpose (see functionalities (1) and (2) above) and its status (for 
backward compatibility).
> 2) "probably more buggy than most other parts of NumPy" ("probably" being a euphemism, IMO);
>   
Trac has these bugs.  Any others?

http://projects.scipy.org/numpy/ticket/1199
http://projects.scipy.org/numpy/ticket/1200
http://projects.scipy.org/numpy/ticket/856
http://projects.scipy.org/numpy/ticket/855
http://projects.scipy.org/numpy/ticket/1231
> 3) "there is not a really good use-case for it" (a conjecture, but one that has yet to be challenged by counter-example); 
>   
See above.
> 4) it's not the first time its presence in NumPy has been questioned ("as Stefan pointed out when asking this same question last year")
>   
Hopefully we're addressing that now.
> 5) NumPy already has a (perhaps superior) alternative ("object arrays would do nicely if one needs this functionality");
>   
No -- that gives the problem of even slower Python-looping to do 
vectorized string operations.
> to which I'll add:
>
> 6) it is, on its face, "counter to the spirit" of NumPy.
>   
I don't quite know what this means -- but I do find the fact that it's a 
view class with methods a little bit clumsy.  Is that what you meant?

So here's my TODO list related to all this:

1) Fix bugs in Trac
2) Improve documentation (though probably not in a method-by-method way)
3) Improve unit test coverage
4a) Create C-based vectorized string operations
4b) Refactor chararray in terms of those
4c) Design and create an interface to those methods that will be the 
"right way" going forward

Anything else?

Mike

-- 
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA