[Numpy-discussion] Massive differences in numpy vs. numeric string handling

Wed Apr 12 15:32:04 EDT 2006

Travis Oliphant wrote:

> Jeremy Gore wrote:
>
>> In Numeric:
>>
>> Numeric.array('test') -> array([t, e, s, t],'c'); shape = (4,)
>> Numeric.array(['test','two']) ->
>> array([[t, e, s, t],
>>        [t, w, o,  ]],'c')
>>
>> but in numpy:
>>
>> numpy.array('test') -> array('test', dtype='|S4'); shape = ()
>> numpy.array('test','S1') -> array('t', dtype='|S1'); shape = ()
>>
>> in fact you have to do an extra list cast:
>>
>> numpy.array(list('test'),'S1') -> array([t, e, s, t], dtype='|S1');  
>> shape = (4,)
>>
>> to get the desired result.  I don't think this is very pythonic, as  
>> strings are fully indexable and iterable objects.
>
>
>
> Let's not cast this discussion in Pythonic vs. un-pythonic because 
> that does not really shed light on the issues.
>
> NumPy adds full support for string arrays.   Numeric had this 
> step-child called a character array which was really just an array of 
> bytes that printed differently. 
> This does raise some compatibility issues that have been hard to get 
> exactly right, and convertcode indeed does not really solve the 
> problem for a heavy character-array user.    I have resisted simply 
> adding back a 1-character string data-type back into NumPy,  but that 
> could be done if it is really necessary.  But, I don't think it is.
>
>>   Furthermore,  converting/treating a string as an array of 
>> characters is a very  common thing.  convertcode.py would not appear 
>> to convert this part  of the code correctly either.  Also, the use of 
>> quotes in the shape  () array but not in the shape (4,) array is 
>> inconsistent.
>
>
>>
>>
>> I realize the ability to use strings of arbitrary length as array  
>> elements is important in numpy, but there really should be a more  
>> natural option to convert/cast strings as character arrays.
>
>
> Perhaps all that is needed to simplify handling is to handle the 'S1' 
> case better so that
>
> array('test','S1')  works the same as array('test','c') used to work 
> (i.e. not stopping at strings for the sequence decomposition).

It seems a little wacky that 'S2' and 'S1' would have vastly different 
behaviour.

>>
>> Also, unlike Numeric.equal and 'c' arrays, numpy.equal cannot 
>> compare  '|S1' arrays or presumably other strings for equality, 
>> although this  is a very useful comparison to make.
>
>
> This is a known missing feature due to the fact that comparisons use 
> ufuncs but ufuncs are not supported for variable-length arrays.   
> Currently, however you can use the chararray class which does allow 
> comparisons of strings.

It seems like this should be easy to worm around in __cmp__ (or 
array_compare or however it's spelled). Since the strings really have a 
fixed length, they're more or less equivalent to byte arrays with one 
extra dimension. Writing a little lexographic comparison thing on top of 
the results of a ufunc operating on the result of  a compare of these 
byte arrays should be a piece of cake; in theory at least.

>
> There are simple ways to work around this, of course.   If you do have 
> 'S1' arrays, then you can simply view them as unsigned bytes (using 
> the .view method) and do comparison that way. 
> if s1 and s2 are "character arrays"
>
> s1.view(ubyte) >= s2.view(ubyte)

Nice!

Regards,

-tim