[Numpy-discussion] Massive differences in numpy vs. numeric string handling
Tim Hochberg
tim.hochberg at cox.net
Wed Apr 12 15:32:04 EDT 2006
Travis Oliphant wrote:
> Jeremy Gore wrote:
>
>> In Numeric:
>>
>> Numeric.array('test') -> array([t, e, s, t],'c'); shape = (4,)
>> Numeric.array(['test','two']) ->
>> array([[t, e, s, t],
>> [t, w, o, ]],'c')
>>
>> but in numpy:
>>
>> numpy.array('test') -> array('test', dtype='|S4'); shape = ()
>> numpy.array('test','S1') -> array('t', dtype='|S1'); shape = ()
>>
>> in fact you have to do an extra list cast:
>>
>> numpy.array(list('test'),'S1') -> array([t, e, s, t], dtype='|S1');
>> shape = (4,)
>>
>> to get the desired result. I don't think this is very pythonic, as
>> strings are fully indexable and iterable objects.
>
>
>
> Let's not cast this discussion in Pythonic vs. un-pythonic because
> that does not really shed light on the issues.
>
> NumPy adds full support for string arrays. Numeric had this
> step-child called a character array which was really just an array of
> bytes that printed differently.
> This does raise some compatibility issues that have been hard to get
> exactly right, and convertcode indeed does not really solve the
> problem for a heavy character-array user. I have resisted simply
> adding back a 1-character string data-type back into NumPy, but that
> could be done if it is really necessary. But, I don't think it is.
>
>> Furthermore, converting/treating a string as an array of
>> characters is a very common thing. convertcode.py would not appear
>> to convert this part of the code correctly either. Also, the use of
>> quotes in the shape () array but not in the shape (4,) array is
>> inconsistent.
>
>
>>
>>
>> I realize the ability to use strings of arbitrary length as array
>> elements is important in numpy, but there really should be a more
>> natural option to convert/cast strings as character arrays.
>
>
> Perhaps all that is needed to simplify handling is to handle the 'S1'
> case better so that
>
> array('test','S1') works the same as array('test','c') used to work
> (i.e. not stopping at strings for the sequence decomposition).
It seems a little wacky that 'S2' and 'S1' would have vastly different
behaviour.
>>
>> Also, unlike Numeric.equal and 'c' arrays, numpy.equal cannot
>> compare '|S1' arrays or presumably other strings for equality,
>> although this is a very useful comparison to make.
>
>
> This is a known missing feature due to the fact that comparisons use
> ufuncs but ufuncs are not supported for variable-length arrays.
> Currently, however you can use the chararray class which does allow
> comparisons of strings.
It seems like this should be easy to worm around in __cmp__ (or
array_compare or however it's spelled). Since the strings really have a
fixed length, they're more or less equivalent to byte arrays with one
extra dimension. Writing a little lexographic comparison thing on top of
the results of a ufunc operating on the result of a compare of these
byte arrays should be a piece of cake; in theory at least.
>
> There are simple ways to work around this, of course. If you do have
> 'S1' arrays, then you can simply view them as unsigned bytes (using
> the .view method) and do comparison that way.
> if s1 and s2 are "character arrays"
>
> s1.view(ubyte) >= s2.view(ubyte)
Nice!
Regards,
-tim
More information about the NumPy-Discussion
mailing list