[Numpy-discussion] numpy.array() of mixed integers and strings can truncate data

Benjamin Root ben.root at ou.edu
Thu Dec 1 10:29:39 EST 2011


On Thursday, December 1, 2011, Thouis Jones <thouis.jones at curie.fr> wrote:
> On Thu, Dec 1, 2011 at 15:47, Pierre Haessig <pierre.haessig at crans.org>
wrote:
>> Le 01/12/2011 14:52, Thouis (Ray) Jones a écrit :
>>> Is this expected behavior?
>>>
>>>>>> np.array([-345,4,2,'ABC'])
>>> array(['-34', '4', '2', 'ABC'], dtype='|S3')
>>>
>>>
>> With my numpy 1.5.1, I got indeed a different result:
>>
>> In [1]: np.array([-345,4,2,'ABC'])
>> Out[1]:
>> array(['-345', '4', '2', 'ABC'],
>>      dtype='|S8')
>
> This is closer to what I would expect.
>
>> The type casting is a bit different, and actually may better match what
>> you expect, but still a casting is required
>> (i.e. you cannot have a "numpy.array() of mixed integers and strings"
>> because numpy arrays only store *homogenous* sets of data)
>
> Of course, but when converting from a non-homogenous python list, I
> would expect it to do something reasonable (or at least not as bad as
> turning -345 into '-34').
>
>> Now one question remains for me : why use a numpy array to store a few
>> strings, and not just a regular Python list ?
>
> It was a small test case.  The actual data is much larger.
>
> Ray Jones
>

This is total speculation on my part.  My suspicion is that the loading
process sees numbers and starts casting in that manner, then it sees the
string and realizes that it has to cast everything to a fixed width string.
 The width is determined as the width of the longest string.  Since -345
was already processed as a number, it never considers its string
representation length.

Does the same problem occur if -345 comes after "ABC"?

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111201/308e6736/attachment.html>


More information about the NumPy-Discussion mailing list