[Numpy-discussion] load from text files Pull Request Review

Mon Sep 12 19:38:13 EDT 2011

I did some timings to see what the advantage would be, in the simplest
case possible, of taking multiple lines from the file to process at a
time. Assuming the dtype is already known. The code is attached. What
I found was I can't use generators to avoid constructing a list and
then making a tuple from the list. It appears that the user must
create a tuple to place in a numpy record array. (Specifically, if you
remove the 'tuple' command from f2 in the attached then you get an
error.) Taking multiple lines at a time (using f4) does provide a
speed benefit, but it's not very big since Python's re module won't
let you capture more than 100 values, and I'm using capturing to
extract the values. (This is done because we're allowing the user to
use regular expressions to denote delimiters.)

In the example it's a bunch of space-delimited integers. f1 splits on
the space and uses a list comprehension, f2 splits on the space and
uses a generator, f3 uses regular expressions in a manner similar to
the current code, and f4 uses regular expressions on multiple lines at
once, and f5 uses fromiter. (Though fromiter isn't as useful as I'd
hoped because you have to have already parsed out a line, since this
is a record array.) f6 and f7 use stripped down versions of Chris
Barker's accumulator idea. The difference is that f6 uses resize when
expanding the array while f7 uses np.empty followed by np.append. This
avoids the penalty from copying data that np.resize imposes. Note that
f6 and f7 use the regular expression capturing line by line as in f3.
To get a feel for the overheard involved with keeping track of string
sizes, f8 is just f3 except with a list for the largest string sizes
seen so far.

The speeds I get using timeit are
f1 : 1.66ms
f2 : 2.01ms
f3 : 2.35ms
f4(2) : 3.02ms (Odd that it starts out worse than f3 when you take 2
lines at a time)
f4(5) : 2.25ms
f4(10) : 2.02ms
f4(15) : 1.93ms
f4(20) : error
f5 : 2.28ms (As I said, fromiter can't do much when it's just filling
in a record array. While it's slightly faster than f3, which it's
based on, it also loads all the data as a list before creating a numpy
array, which is rather suboptimal.)
f6 : 3.26ms
f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by
append then do to resize)
f8 : 3.04ms (Compared to f3, this shows there's a non-trivial
performance hit from keeping track of the sizes)

It seems like taking multiple lines at once isn't a big gain when
we're limited to 100 captured entries at a time. (For Python 2.6, at
least.) Especially since taking multiple lines at once would be rather
complex since the code must still check each line to see if it's
commented out or not.

After talking to Chris Farrow, an Enthought developer, and doing some
timing on a dataset he was working on, I had loadtable running about
1.7 to 3.3 times as fast as genfromtxt. The catch is that genfromtxt
was loading datetimes as strings, while loadtable was loading them as
numpy datetimes. The conversion from string to datetime is somewhat
expensive, so I think that accounts for some of the extra time. The
range of timings--between 1.5 to 3.5 times as slow--reflect how many
lines are used to check for sizes and dtypes. As it turns out,
checking for those can be quite expensive, and the majority of the
time seems to be spent in the regular expression matching. (Though
Chris is using a slight variant on my pull request, and I'm getting
function times that are not as bad as his.) The cost of the size and
type checking was less apparent in the example I have timings on in a
previous email because in that case there was a huge cost for
converting data with commas instead of decimals and for the datetime
conversion.

To give some further context, I compared np.genfromtxt and
np.loadtable on the same 'pseudo-file' f used in the above tests, when
the data is just a bunch of integers. The results were:

np.genfromtxt with dtype=None: 4.45 ms
np.loadtable with defaults: 5.12ms
np.loadtable with check_sizes=False: 3.7ms

So it seems that np.loadtable is already competitive with
np.genfromtxt other than checking the sizes. And the checking sizes
isn't even that huge a penalty compared to genfromtxt.

Based on all the above it seems like the accumulator is the most
promising way that things could be sped up. But it's not completely
clear to me by how much, since we still must keep track of the dtypes
and the sizes.

Other than possibly changing loadtable to use np.NA instead of masked
arrays in the presence of missing data, I'm starting to feel like it's
more or less complete for now, and can be left to be improved in the
future. Most of the things that have been discussed are either
performance trade-offs or somewhat large re-engineering of the
internals.

-Chris JS

On Thu, Sep 8, 2011 at 3:57 PM, Chris.Barker <Chris.Barker at noaa.gov> wrote:
> On 9/8/11 1:43 PM, Christopher Jordan-Squire wrote:
>> I just ran a quick test on my machine of this idea. With
>>
>> dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)])
>> temp = np.empty((), dtype=dt)
>> temp2 = np.zeros(1,dtype=dt)
>>
>> In [96]: def f():
>>      ...:     l=[0]*3
>>      ...:     l[0] = 2.54
>>      ...:     l[1] = 4
>>      ...:     l[2] = 2.3645
>>      ...:     j = tuple(l)
>>      ...:     temp2[0] = j
>>
>> vs
>>
>>
>> In [97]: def g():
>>      ...:     temp['x'] = 2.54
>>      ...:     temp['y'] = 4
>>      ...:     temp['z'] = 2.3645
>>      ...:     temp2[0] = temp
>>      ...:
>>
>> The timing results were 2.73 us for f and 3.43 us for g. So good idea,
>> but it doesn't appear to be faster. (Though the difference wasn't
>> nearly as dramatic as I thought it would be, based on Pauli's
>> comment.)
>
> my guess is that the lines like: temp['x'] = 2.54 are slower (it
> requires a dict lookup, and a conversion from a python type to a "raw" type)
>
> and
>
> temp2[0] = temp
>
> is faster, as that doesn't require any conversion.
>
> Which means that if you has a larger struct dtype, it would be even
> slower, so clearly not the way to go for performance.
>
> It would be nice to have a higher performing struct dtype scalar -- as
> it is ordered, it might be nice to be able to index it with either the
> name or an numeric index.
>
> -Chris
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: txt_time.py
Type: text/x-python
Size: 4472 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110912/0a6a9874/attachment.py>