[Numpy-discussion] searchsorted() and memory cache

Sat May 10 12:55:53 EDT 2008

On Sat, May 10, 2008 at 9:31 AM, Stéfan van der Walt <stefan at sun.ac.za>
wrote:

> Hi Andrew
>
> 2008/5/9 Andrew Straw <strawman at astraw.com>:
> > I've got a big element array (25 million int64s) that searchsorted()
> > takes a long time to grind through. After a bit of digging in the
> > literature and the numpy source code, I believe that searchsorted() is
> > implementing a classic binary search, which is pretty bad in terms of
> > cache misses. There are several modern implementations of binary search
> > which arrange items in memory such that cache misses are much more rare.
> > Clearly making such an indexing arrangement would take time, but in my
> > particular case, I can spare the time to create an index if searching
> > was faster, since I'd make the index once but do the searching many
> times.
> >
> > Is there an implementation of such an algorithm that works easilty with
> > numpy? Also, can you offer any advice, suggestions, and comments to me
> > if I attempted to implement such an algorithm?
>
> If found Francesc Altet's Pyrex implementation at
>
> http://mail.python.org/pipermail/python-list/2007-November/466503.html
>
> I modified it for use with Cython and added some tests:
>
> https://code.launchpad.net/~stefanv/+junk/my_bisect<https://code.launchpad.net/%7Estefanv/+junk/my_bisect>
>
> That may be a good starting point for further experimentation.  As it
> is, it is already about 10 times faster than the built-in version.

The built in version is in c, but not type specific. It could be moved to
the generated code base easily enough. The slow part is here

            if (compare(parr + elsize*imid, pkey, key) < 0)

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080510/210d3580/attachment.html>