[Numpy-discussion] numpy.array() of mixed integers and strings can truncate data

Charles R Harris charlesr.harris at gmail.com
Fri Dec 2 12:53:44 EST 2011


On Fri, Dec 2, 2011 at 8:23 AM, Thouis (Ray) Jones <thouis at gmail.com> wrote:

> On Thu, Dec 1, 2011 at 17:39, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> > Given that strings should be the result, this looks like a bug. It's a
> bit
> > of a corner case that probably slipped through during the recent work on
> > casting. There needs to be tests for these sorts of things, so if you
> find
> > more oddities post them so we can add them.
>
> I'm happy to add a patch and tests, but could use some guidance...
>
> It looks like discover_itemsize() in core/src/multiarray/ctors.c
> should compute the length of the string or unicode representation of
> the object based on the eventual type, but looking at
> UNICODE_setitem() and STRING_setitem() in
> core/src/multiarray/arraytypes.c.src, this is not trivial.
>
> Perhaps the object-to-unicode/string parts of
> UNICODE_setitem/STRING_setitem can be extracted into separate
> functions that can be called from *_setitem as well as
> discover_itemsize.   discover_itemsize would also need to know the
> type it's discovering for (string or unicode or user-defined).
>
>
After sleeping on this, I think an object array in this situation would be
the better choice and wouldn't result in lost information. This might
change the behavior of
some functions though, so would need testing.

Not sure what to do to handle user-defined types (error?).
>
> If that's is too complicated, maybe discover_itemsize should return -1
> (or warn, but given the danger of truncation, that seems a bit weak)
> if asked to discover from data that doesn't have a length.  This would
> result in dtype=object when np.array is handed a mixed int/string
> list.
>
> I wonder, also, if STRING_setitem and UNICODE_setitem shouldn't emit a
> warning if asked to truncate data?
>
>
I think a warning would be useful. But I don't use strings much so input
from a user might carry more weight.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111202/252c2e07/attachment.html>


More information about the NumPy-Discussion mailing list