[Numpy-discussion] `missing` argument in genfromtxt only a string?

Tue Sep 15 10:57:36 EDT 2009

On Tue, Sep 15, 2009 at 10:44 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Tue, Sep 15, 2009 at 9:43 AM, Bruce Southey <bsouthey at gmail.com> wrote:
>> On 09/14/2009 09:31 PM, Skipper Seabold wrote:
>>> On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM<pgmdevlist at gmail.com>  wrote:
>>>
>> [snip]
>>>> OK, I see the problem...
>>>> When no dtype is defined, we try to guess what a converter should
>>>> return by testing its inputs. At first we check whether the input is a
>>>> boolean, then whether it's an integer, then a float, and so on. When
>>>> you define explicitly a converter, there's no need for all those
>>>> checks, so we lock the converter to a particular state, which sets the
>>>> conversion function and the value to return in case of missing.
>>>> Except that I messed it up and it fails in that case (the conversion
>>>> function is set properly, bu the dtype of the output is still
>>>> undefined). That's a bug, I'll try to fix that once I've tamed my snow
>>>> kitten.
>>>>
>>> No worries.  I really like genfromtxt (having recently gotten pretty
>>> familiar with it) and would like to help out with extending it towards
>>> these kind of cases if there's an interest and this is feasible.
>>>
>>> I tried another workaround for the dates with my converters defined as conv
>>>
>>> conv.update({date : lambda s : datetime(*map(int,
>>> s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})
>>>
>>> Where `date` is the column that contains a date.  The problem was that
>>> my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
>>> for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
>>> but gave an error about not finding the day in the third position,
>>> though that lambda function worked for a test case outside of
>>> genfromtxt.
>>>
>>>
>>>> Meanwhile, you can use tsfromtxt (in scikits.timeseries),
>>>>
>> In SAS there are multiple ways to define formats especially dates:
>> http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a002200738.htm
>>
>> It would be nice to accept the common variants (USA vs English dates) as
>> well as two digit vs 4 digit year codes.
>>
>
> This is relevant to what I've been doing.  I parsed a SAS input file
> to get the information to pass to genfromtxt, and it might be useful
> to have these types defined.  Again, I'm wondering about whether the
> new datetime dtype might eventually be used for something like this.
>
> Do you know if SAS publishes the format of its datasets, similar to
> Stata?  http://www.stata.com/help.cgi?dta
>
>>
>>
>>>> or even
>>>> simpler, define a dtype for the output (you know that your first
>>>> column is a str, your second an object, and the others ints or floats...
>>>>
>>>>
>> How do you specify different dtypes in genfromtxt?
>> I could not see the information in the docstring and the dtype argument
>> does not appear to allow multiple dtypes.
>>
>
> I have also been struggling with this (and modifying the dtype of
> field in structured array in place, btw).  To give a quick example,
> here are some of the ways that I expected to work and didn't and a few
> ways that work.
>
> from StringIO import StringIO
> import numpy as np
>
> # a few incorrect ones
>
> s = StringIO("11.3abcde")
> data = np.genfromtxt(s, dtype=np.dtype(int, float, str), delimiter=[1,3,5])
>
> In [42]: data
> Out[42]: array([ 1,  1, -1])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(float, int, str), delimiter=[1,3,5])
>
> In [45]: data
> Out[45]: array([ 1. ,  1.3,  NaN])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype(str, float, int), delimiter=[1,3,5])
>
> In [48]: data
> Out[48]:
> array(['1', '1.3', 'abcde'],
>      dtype='|S5')

these are not problem of genfromtxt, the dtype construction is not
what you think it is. What the second and third arguments are, I don't
know

>>> np.dtype(int,float,str)
dtype('int32')
>>> np.dtype(float,float,str)
dtype('float64')
>>> np.dtype(str,float,str)
dtype('|S0')

I think the versions below are the correct way of specifying a structured dtype.

Josef

>
> # correct few
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('myint','i8'),('myfloat','f8'),('mystring','a5')]),
> delimiter=[1,3,5])
>
> In [52]: data
> Out[52]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=[1,3,5])
>
> In [55]: data
> Out[55]:
> array((1, 1.3, 'abcde'),
>      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '|S5')])
>
> # one I expected to work but have probably made an obvious mistake
>
> s.seek(0)
> data = np.genfromtxt(s, dtype=np.dtype('i8','f8','a5'),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [64]: data
> Out[64]: array([ 1,  1, -1])
>
> # "ugly" way to do this, but it works
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=np.dtype([('','i8'),('','f8'),('','a5')]),
> names=['myint','myfloat','mystring'], delimiter=[1,3,5])
>
> In [69]: data
> Out[69]:
> array((1, 1.3, 'abcde'),
>      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
>
>
> Skipper
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>