[Numpy-discussion] `missing` argument in genfromtxt only a string?

Tue Sep 15 10:44:16 EDT 2009

On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<pgmdevlist at gmail.com>  wrote:
>>
> [snip]
>>> OK, I see the problem...
>>> When no dtype is defined, we try to guess what a converter should
>>> return by testing its inputs. At first we check whether the input is a
>>> boolean, then whether it's an integer, then a float, and so on. When
>>> you define explicitly a converter, there's no need for all those
>>> checks, so we lock the converter to a particular state, which sets the
>>> conversion function and the value to return in case of missing.
>>> Except that I messed it up and it fails in that case (the conversion
>>> function is set properly, bu the dtype of the output is still
>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>> kitten.
>>>
>> No worries.  I really like genfromtxt (having recently gotten pretty
>> familiar with it) and would like to help out with extending it towards
>> these kind of cases if there's an interest and this is feasible.
>>
>> I tried another workaround for the dates with my converters defined as conv
>>
>> conv.update({date : lambda s : datetime(*map(int,
>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>
>> Where `date` is the column that contains a date.  The problem was that
>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>> but gave an error about not finding the day in the third position,
>> though that lambda function worked for a test case outside of
>> genfromtxt.
>>
>>
>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>
> In SAS there are multiple ways to define formats especially dates:
> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>
> It would be nice to accept the common variants (USA vs English dates) as
> well as two digit vs 4 digit year codes.
>

This is relevant to what I've been doing.  I parsed a SAS input file
to get the information to pass to genfromtxt, and it might be useful
to have these types defined.  Again, I'm wondering about whether the
new datetime dtype might eventually be used for something like this.

Do you know if SAS publishes the format of its datasets, similar to
Stata?  http://www.stata.com/help.cgi?dta

>
>
>>> or even
>>> simpler, define a dtype for the output (you know that your first
>>> column is a str, your second an object, and the others ints or floats...
>>>
>>>
> How do you specify different dtypes in genfromtxt?
> I could not see the information in the docstring and the dtype argument
> does not appear to allow multiple dtypes.
>

I have also been struggling with this (and modifying the dtype of
field in structured array in place, btw).  To give a quick example,
here are some of the ways that I expected to work and didn't and a few
ways that work.

from StringIO import StringIO
import numpy as np

# a few incorrect ones

s = StringIO("11.3abcde")
data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])

In [42]: data
Out[42]: array([ 1,  1, -1])

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])

In [45]: data
Out[45]: array([ 1. ,  1.3,  NaN])

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])

In [48]: data
Out[48]:
array(['1', '1.3', 'abcde'],
      dtype='|S5')

# correct few

s.seek(0)
data = np.genfromtxt(s,
dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
delimiter=[1,3,5])

In [52]: data
Out[52]:
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

s.seek(0)
data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])

In [55]: data
Out[55]:
array((1, 1.3, 'abcde'),
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])

# one I expected to work but have probably made an obvious mistake

s.seek(0)
data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
names=['myint','myfloat','mystring'], delimiter=[1,3,5])

In [64]: data
Out[64]: array([ 1,  1, -1])

# "ugly" way to do this, but it works

s.seek(0)
data = np.genfromtxt(s,
dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
names=['myint','myfloat','mystring'], delimiter=[1,3,5])

In [69]: data
Out[69]:
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

Skipper