[Numpy-discussion] Using gentxt to import a csv with a string class label and hundreds of integer features

Fri May 8 08:42:53 EDT 2015

On Thu, May 7, 2015 at 2:26 AM, Dammy <damilarefagbemi at gmail.com> wrote:
> Hi,
> I am trying to use numpy.gentxt to import a csv for classification using
> scikit-learn. The first column in the csv is a string type class label while
> 200+ extra columns are integer features.
> Please I wish to find out how I can use the gentext function to specify a
> dtype of string for the first column while specifying int type for all other
> columns.
>
> I have tried using "dtype=None" as shown below, but when I print
> dataset.shape,  I get (number_or_rows,) i.e no columns are read in:
>  dataset = np.genfromtxt(file,delimiter=',', skip_header=True)
>
> I also tried setting the dtypes as shown in the examples below, but I get
> the same error as dtype=None:

these dtypes will create structured arrays:
http://docs.scipy.org/doc/numpy/user/basics.rec.html

so it is expected that the shape is the number of rows, the colums are
part of the dtype and can be accessed like a dictionary:

In [21]: d = np.ones(3, dtype='S2, int8')

In [22]: d
Out[22]:
array([('1', 1), ('1', 1), ('1', 1)],
      dtype=[('f0', 'S2'), ('f1', 'i1')])

In [23]: d.shape
Out[23]: (3,)

In [24]: d.dtype.names
Out[24]: ('f0', 'f1')

In [25]: d[0]
Out[25]: ('1', 1)

In [26]: d['f0']
Out[26]:
array(['1', '1', '1'],
      dtype='|S2')

In [27]: d['f1']
Out[27]: array([1, 1, 1], dtype=int8)