numpy 00 character bug?

Nathaniel Rook nrook at wesleyan.edu
Fri Jun 5 12:14:10 EDT 2009


Hello, all!

I've recently encountered a bug in NumPy's string arrays, where the 00 
ASCII character ('\x00') is not stored properly when put at the end of a 
string.

For example:

Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import numpy
 >>> print numpy.version.version
1.3.0
 >>> arr = numpy.empty(1, 'S2')
 >>> arr[0] = 'ab'
 >>> arr
array(['ab'],
       dtype='|S2')
 >>> arr[0] = 'c\x00'
 >>> arr
array(['c'],
       dtype='|S2')

It seems that the string array is using the 00 character to pad strings 
smaller than the maximum size, and thus is treating any 00 characters at 
the end of a string as padding.  Obviously, as long as I don't use 
smaller strings, there is no information lost here, but I don't want to 
have to re-add my 00s each time I ask the array what it is holding.

Is this a well-known bug already?  I couldn't find it on the NumPy bug 
tracker, but I could have easily missed it, or it could be triaged, 
deemed acceptable because there's no better way to deal with 
arbitrary-length strings.  Is there an easy way to avoid this problem? 
Pretty much any performance-intensive part of my program is going to be 
dealing with these arrays, so I don't want to just replace them with a 
slower dictionary instead.

I can't imagine this issue hasn't come up before; I encountered it by 
using NumPy arrays to store Python structs, something I can imagine is 
done fairly often.  As such, I apologize for bringing it up again!

Nathaniel



More information about the Python-list mailing list