[Numpy-discussion] seeking advice on a fast string->array conversion

Tue Nov 16 13:44:10 EST 2010

On 11/16/10 10:01 AM, Christopher Barker wrote:

> OK -- I'll whip up a test similar to yours -- stay tuned!

Here's what I've done:

import numpy as np
from maproomlib.utility import file_scanner

def gen_file():
     f = file('test.dat', 'w')
     for i in range(1200):
         f.write('1 ' * 2048)
         f.write('\n')
     f.close()

def read_file1():
     """ read unknown length: doubles"""
     f = file('test.dat')
     arr = file_scanner.FileScan(f)
     f.close()
     return arr

def read_file2():
     """ read known length: doubles"""
     f = file('test.dat')
     arr = file_scanner.FileScanN(f, 1200*2048)
     f.close()
     return arr

def read_file3():
     """ read known length: singles"""
     f = file('test.dat')
     arr = file_scanner.FileScanN_single(f, 1200*2048)
     f.close()
     return arr

def read_fromfile1():
     """ read unknown length with fromfile(): singles"""
     f = file('test.dat')
     arr = np.fromfile(f, dtype=np.float32, sep=' ')
     f.close()
     return arr

def read_fromfile2():
     """ read unknown length with fromfile(): doubles"""
     f = file('test.dat')
     arr = np.fromfile(f, dtype=np.float64, sep=' ')
     f.close()
     return arr

def read_fromstring1():
     """ read unknown length with fromstring(): singles"""
     f = file('test.dat')
     str = f.read()
     arr = np.fromstring(str, dtype=np.float32, sep=' ')
     f.close()
     return arr

And the results (ipython's timeit):

In [40]: timeit test.read_fromfile1()
1 loops, best of 3: 561 ms per loop

In [41]: timeit test.read_fromfile2()
1 loops, best of 3: 570 ms per loop

In [42]: timeit test.read_file1()
1 loops, best of 3: 336 ms per loop

In [43]: timeit test.read_file2()
1 loops, best of 3: 341 ms per loop

In [44]: timeit test.read_file3()
1 loops, best of 3: 515 ms per loop

In [46]: timeit test.read_fromstring1()
1 loops, best of 3: 301 ms per loop

So my filescanner is faster, but not radically so, than fromfile(). 
However, reading the whole file into a string, then using fromstring() 
is, in fact, tne fastest method -- interesting -- shows you why you need 
to profile!

Also, with my code, reading singles is slower than doubles -- odd. 
Perhaps the C lib fscanf read doubles anyway, then converts to singles?

Anyway, for my needs, my file_scanner and fromfile() are fast enough, 
and much faster than parsing the files with Python. My issue with 
fromfile is flexibility and robustness -- it's buggy in the face of 
ill-formed files. See the list archives and the bug reports for more detail.

Still, it seems your very basic method is indeed a faster way to go.

I've enclosed the files. It's currently built as part of a larger lib, 
so no setup.py -- though it could be written easily enough.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: file_scan_module.c
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20101116/b40e5c38/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_simple_large.py
Type: application/x-python
Size: 1354 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20101116/b40e5c38/attachment.bin>