[SciPy-User] Ignore characters while reading text

Fri Jun 14 11:59:07 EDT 2013

On Fri, Jun 14, 2013 at 7:46 AM, Daπid <davidmenhur at gmail.com> wrote:
> On 14 June 2013 14:32, Matt Newville <newville at cars.uchicago.edu> wrote:
>> Would this do?
>>
>>     import numpy as np
>>     from cStringIO import StringIO
>>     txt= '1 (2 3 4) (5 6 7) (8 9 10)'
>>     np.loadtxt(StringIO(txt.replace('(', '').replace(')', '')))
>
> If I am not mistaken, then you are reading the data twice or thrice.
> If this is big, and performance is critical, you may be better off
> doing the loadtxt yourself. The core of np.loadtxt is essencially a
> "from line in file: data.append(parse(line))", with some wrapping
> intelligence, that probably is not needed in your case.
>
> https://github.com/numpy/numpy/blob/v1.7.0/numpy/lib/npyio.py#L610  ---->loadtxt
> https://github.com/numpy/numpy/blob/v1.7.0/numpy/lib/npyio.py#L1573
> --->genfromtxt
>
>
> David.

What do you mean by "reading the data twice or thrice"?  I would have
said text data in this snippet is stored in a string, but never read
from a disk.    Once read from disk, the string.replace() method is
fast, and StringIO makes a string look like a file-like structure, so
I don't see how data is "read" multiple times.

You are right that numpy.loadtxt is slightly slower than rolling your
own.   For 2Mb  files with 20000 lines of 12 columns (all integers),
the test code below gives:

    Array size =  (20000, 12)
    Results are equivalent?  True True True
    Time, numpy.loadtxt, parens not allowed: 13.5254 sec
    Time, numpy.loadtxt, parens allowed:     13.7965 sec
    Time, python list, parens not allowed:   10.1362 sec
    Time, python list, parens allowed:       10.7494 sec

Allowing for parens with text.replace('(', '') etc is not significant
-- certainly less time than pre-processing the files in any way.
Using numpy.loadtxt is 30%  slower than a direct read to a list, then
conversion to an array, which might make a difference in some cases,
but involves less code, and is more robust against unexpected input.

import timeit
import numpy as np
from cStringIO import StringIO

def f2arr_np0(fname):
    txt = open(fname, 'r').read()
    return np.loadtxt(StringIO(txt))

def f2arr_np1(fname):
    txt = open(fname, 'r').read()
    return np.loadtxt(StringIO(txt.replace('(', '').replace(')', '')))

def f2arr_py0(fname):
    fh  = open(fname, 'r')
    tmp = []
    for line in fh.readlines():
        tmp.append([int(word) for word in line.split()])
    return np.array(tmp)

def f2arr_py1(fname):
    fh  = open(fname, 'r')
    tmp = []
    for line in fh.readlines():
        tmp.append([int(word) for word in
                    line.replace('(', '').replace(')', '').split()])
    return np.array(tmp)

# the file test0.dat has embedded parens, test1.dat does not
p0 = f2arr_py0('test0.dat')
p1 = f2arr_py1('test1.dat')
n0 = f2arr_np0('test0.dat')
n1 = f2arr_np1('test1.dat')

print 'Array size = ', n1.shape
print 'Results are equivalent? ', np.all(p0 == p1), np.all(p0 == n0),
np.all(p0 == n1)

rnp0 = timeit.timeit("f2arr_np0('test0.dat')", setup='from __main__
import f2arr_np0', number=25)
rnp1 = timeit.timeit("f2arr_np1('test1.dat')", setup='from __main__
import f2arr_np1', number=25)

rpy0 = timeit.timeit("f2arr_py0('test0.dat')", setup='from __main__
import f2arr_py0', number=25)
rpy1 = timeit.timeit("f2arr_py1('test1.dat')", setup='from __main__
import f2arr_py1', number=25)

print 'Time, numpy.loadtxt, parens not allowed: %.4f sec' % rnp0
print 'Time, numpy.loadtxt, parens allowed:     %.4f sec' % rnp1
print 'Time, python list, parens not allowed:   %.4f sec' % rpy0
print 'Time, python list, parens allowed:       %.4f sec' % rpy1

--Matt