[Numpy-discussion] seeking advice on a fast string->array conversion

Tue Nov 16 09:41:04 EST 2010

Apologies, I accidentally hit send...

On Tue, Nov 16, 2010 at 9:20 AM, Darren Dale <dsdale24 at gmail.com> wrote:
> I am wrapping up a small package to parse a particular ascii-encoded
> file format generated by a program we use heavily here at the lab. (In
> the unlikely event that you work at a synchrotron, and use Certified
> Scientific's "spec" program, and are actually interested, the code is
> currently available at
> https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/
> .)
>
> I have been benchmarking the project against another python package
> developed by a colleague, which is an extension module written in pure
> C. My python/cython project takes about twice as long to parse and
> index a file (~0.8 seconds for 100MB), which is acceptable. However,
> actually converting ascii strings to numpy arrays, which is done using
> numpy.fromstring,  takes a factor of 10 longer than the extension
> module. So I am wondering about the performance of np.fromstring:

import time
import numpy as np
s = b'1 ' * 2048 *1200
d = time.time()
x = np.fromstring(s, dtype='d', sep=b' ')
print time.time() - d

That takes about 1.3 seconds on my machine. A similar metric for the
extension module is to load 1200 of these 2048-element arrays from the
file:

d=time.time()
x=[s.mca(i+1) for i in xrange(1200)]
print time.time()-d

That takes about 0.127 seconds on my machine. This discrepancy is
unacceptable for my usecase, so I need to develop an alternative to
fromstring. Here is bit of testing with cython:

import time

cdef extern from 'stdlib.h':
    double atof(char*)

py_string = '100'
cdef char* c_string = py_string
cdef int i, j
j=2048*1200

d = time.time()
while i<j:
    c_string = py_string
    val = atof(c_string)
    i += 1
print val, time.time()-d

That loop takes 0.33 seconds to execute, which is a good start. I need
some help converting this example to return an actual numpy array.
Could anyone please offer a suggestion?

Thanks,
Darren