[Numpy-discussion] seeking advice on a fast string->array conversion
Darren Dale
dsdale24 at gmail.com
Tue Nov 16 09:41:04 EST 2010
Apologies, I accidentally hit send...
On Tue, Nov 16, 2010 at 9:20 AM, Darren Dale <dsdale24 at gmail.com> wrote:
> I am wrapping up a small package to parse a particular ascii-encoded
> file format generated by a program we use heavily here at the lab. (In
> the unlikely event that you work at a synchrotron, and use Certified
> Scientific's "spec" program, and are actually interested, the code is
> currently available at
> https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/
> .)
>
> I have been benchmarking the project against another python package
> developed by a colleague, which is an extension module written in pure
> C. My python/cython project takes about twice as long to parse and
> index a file (~0.8 seconds for 100MB), which is acceptable. However,
> actually converting ascii strings to numpy arrays, which is done using
> numpy.fromstring, takes a factor of 10 longer than the extension
> module. So I am wondering about the performance of np.fromstring:
import time
import numpy as np
s = b'1 ' * 2048 *1200
d = time.time()
x = np.fromstring(s, dtype='d', sep=b' ')
print time.time() - d
That takes about 1.3 seconds on my machine. A similar metric for the
extension module is to load 1200 of these 2048-element arrays from the
file:
d=time.time()
x=[s.mca(i+1) for i in xrange(1200)]
print time.time()-d
That takes about 0.127 seconds on my machine. This discrepancy is
unacceptable for my usecase, so I need to develop an alternative to
fromstring. Here is bit of testing with cython:
import time
cdef extern from 'stdlib.h':
double atof(char*)
py_string = '100'
cdef char* c_string = py_string
cdef int i, j
j=2048*1200
d = time.time()
while i<j:
c_string = py_string
val = atof(c_string)
i += 1
print val, time.time()-d
That loop takes 0.33 seconds to execute, which is a good start. I need
some help converting this example to return an actual numpy array.
Could anyone please offer a suggestion?
Thanks,
Darren
More information about the NumPy-Discussion
mailing list