[Numpy-discussion] fromstring() is slow, no really!

Sun May 13 19:34:32 EDT 2012

And I forgot to attach the relevant code (though it is also in my fork)...

On Sun, May 13, 2012 at 6:28 PM, Anthony Scopatz <scopatz at gmail.com> wrote:

> Hello All,
>
> This week, while doing some optimization, I found that np.fromstring()
> is significantly slower than many alternatives out there.  This function
> basically does two things: (1) it splits the string and (2) it converts the
> data to the desired type.
>
> There isn't much we can do about the conversion/casting so what I
> mean is that the *string splitting implementation is slow*.
>
> To simplify the discussion, I will just talk about string to 1d float64
> arrays.
> I have also issued pull request #279 [1] to numpy with some sample code.
> Timings can be seen in the ipython notebook here.
>
> It turns out that using str.split() and np.array() are 20 - 35% faster,
> which
> was non-intuitive to me.  That is to say:
>
> rawdata = s.split()
> data = np.array(rawdata, dtype=float)
>
>
> is faster than
>
> data = np.fromstring(s, sep=" ", dtype=float)
>
>
> The next thing to try, naturally, was Cython.  This did not change the
> timings much for these two  strategies.  However, being in Cython
> allows us to call atof() directly.  My implementation is based on a
> previous
> thread on this topic [2].   However, in the example in [2], the string was
> hard coded, contained only one data value, and did not need to be split.
> Thus they saw a dramatic 10x speed boost.   To deal with the more
> realistic case, I first just continued to use str.split().  This took 35 -
> 50%
> less time than np.fromstring().
>
> Finally, using the strtok() function in the C standard library to call
> atof()
> while we tokenize the string further reduces the speed 50 - 60% of the
> baseline np.fromstring() time.
>
> Timings
> ------------
> In [1]: import fromstr
>
> In [2]: s = "100.0 " * 100000
>
> In [3]: timeit fromstr.fromstring(s)
> 10 loops, best of 3: 20.7 ms per loop
>
> In [4]: timeit fromstr.split_and_array(s)
> 100 loops, best of 3: 16.1 ms per loop
>
> In [6]: timeit fromstr.split_atof(s)
> 100 loops, best of 3: 13.5 ms per loop
>
> In [7]: timeit fromstr.token_atof(s)
> 100 loops, best of 3: 8.35 ms per loop
>
> Possible Explanation
> ----------------------------------
> Numpy's fromstring() function may be found here [3].  However, this code
> is a bit hard to follow but it uses the array_from_text() function [4].
>  On the
> other hand str.split() [5] uses a macro function SPLIT_ADD().   The
> difference
> between these is that I believe that str.split() over-allocates the size
> of the
> list in a more aggressive way than array_from_text().  This leads to fewer
> resizes and thus fewer memory copies.
>
> This would also explain why the tokenize implementation is the fastest
> since
> this pre-allocates the maximum possible array size and then slices it
> down.
> No resizes are present in this function, though it requires more memory up
> front.
>
> Summary (tl;dr)
> ------------------------
> The np.fromstring() is slow in the mechanism it chooses to split strings
> by.
> This is likely due to how many resize operations it must perform.  While
> it
> need not be the* *fastest* *thing out there, it should probably be at
> least as
> fast at Python string splitting.
>
> No pull-request 'fixing' this issue was provided because I wanted to see
> what people thought and if / which option is worth pursuing.
>
> Be Well
> Anthony
>
> [1] https://github.com/numpy/numpy/pull/279
> [2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504
> [3]
> https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699
> [4]
> https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418
> [5]
> http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120513/3871d304/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fromstr.pyx
Type: application/octet-stream
Size: 1530 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120513/3871d304/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: setup.py
Type: application/octet-stream
Size: 251 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120513/3871d304/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fromstr.ipynb
Type: application/octet-stream
Size: 3436 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120513/3871d304/attachment-0002.obj>