Newbie - converting csv files to arrays in NumPy - Matlab vs. Numpy comparison

Thu Jan 11 03:11:25 EST 2007

sturlamolden wrote:
> oyekomova wrote:
> > Thanks for your help. I compared the following code in NumPy with the
> > csvread in Matlab for a very large csv file. Matlab read the file in
> > 577 seconds. On the other hand, this code below kept running for over 2
> > hours. Can this program be made more efficient? FYI - The csv file was
> > a simple 6 column file with a header row and more than a million
> > records.
> >
> >
> > import csv
> > from numpy import array
> > import time
> > t1=time.clock()
> > file_to_read = file('somename.csv','r')
> > read_from = csv.reader(file_to_read)
> > read_from.next()
>
> > datalist = [ map(float, row[:]) for row in read_from ]
>
> I'm willing to bet that this is your problem. Python lists are arrays
> under the hood!
>
> Try something like this instead:
>
>
> # read the whole file in one chunk
> lines = file_to_read.readlines()
> # count the number of columns
> n = 1
> for c in lines[1]:
>    if c == ',': n += 1
> # count the number of rows
> m = len(lines[1:])

Please consider using
    m = len(lines) - 1

> #allocate
> data = empty((m,n),dtype=float)
> # create csv reader, skip header
> reader = csv.reader(lines[1:])

lines[1:] again?
The OP set you an example:
    read_from.next()
so you could use:
    reader = csv.reader(lines)
    _unused = reader.next()

> # read
> for i in arange(0,m):
>    data[i,:] = map(float,reader.next())
>
> And if this is too slow, you may consider vectorizing the last loop:
>
> data = empty((m,n),dtype=float)
> newstr = ",".join(lines[1:])
> flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy
> reader = csv.reader([newstr])
> flatdata[:] = map(float,reader.next())
> 
> I hope this helps!