[Numpy-discussion] numpy.array() of mixed integers and strings can truncate data

Fri Dec 2 10:23:46 EST 2011

On Thu, Dec 1, 2011 at 17:39, Charles R Harris
<charlesr.harris at gmail.com> wrote:
> Given that strings should be the result, this looks like a bug. It's a bit
> of a corner case that probably slipped through during the recent work on
> casting. There needs to be tests for these sorts of things, so if you find
> more oddities post them so we can add them.

I'm happy to add a patch and tests, but could use some guidance...

It looks like discover_itemsize() in core/src/multiarray/ctors.c
should compute the length of the string or unicode representation of
the object based on the eventual type, but looking at
UNICODE_setitem() and STRING_setitem() in
core/src/multiarray/arraytypes.c.src, this is not trivial.

Perhaps the object-to-unicode/string parts of
UNICODE_setitem/STRING_setitem can be extracted into separate
functions that can be called from *_setitem as well as
discover_itemsize.   discover_itemsize would also need to know the
type it's discovering for (string or unicode or user-defined).

Not sure what to do to handle user-defined types (error?).

If that's is too complicated, maybe discover_itemsize should return -1
(or warn, but given the danger of truncation, that seems a bit weak)
if asked to discover from data that doesn't have a length.  This would
result in dtype=object when np.array is handed a mixed int/string
list.

I wonder, also, if STRING_setitem and UNICODE_setitem shouldn't emit a
warning if asked to truncate data?

Ray Jones