[Numpy-discussion] Radix sort?

Tue Jun 19 16:29:05 EDT 2007

On 6/19/07, Jon Wright <wright at esrf.fr> wrote:
>
> Dear numpy experts,
>
> I see from the docs that there seem to be 3 sorting algorithms for array
> data (quicksort, mergesort and heapsort). After hearing a rumour about
> radix sorts and floats I google'd and now I'm wondering about a radix
> sort for numpy (and Numeric) scalars? See:
>
> http://www.stereopsis.com/radix.html
> http://en.wikipedia.org/wiki/Radix_sort
>
> The algorithm is apparently a trick where no comparisons are used. A
> shockingly bad benchmark of a swig wrapped test function below suggests
> it really is quicker than the array.sort() numpy method for uint8. At
> least with >256 element uint8 test arrays (numpy1.0.3, mingw32, winXP),
> for me, today; of course ymmv.
>
> With large enough arrays of all of the integer numeric types and also
> ieee reals, appropriate versions of the radix sort might be able to:
>     "... kick major ass in the speed department."
> [http://www.cs.berkeley.edu/~jrs/61b/lec/36]
>
> Have these sorts already been implemented in scipy? Can someone share
> some expertise in this area? There is a problem about the sorting not
> being done in-place (eg: better for argsort than sort). I see there is a
> lexsort, but it is not 100% obvious to me how to use that to sort
> scalars, especially ieee floats. If numpy is a library for handling big
> chunks of numbers with knowable binary representations, I guess there
> might be some general interest in having this radix sort compiled for
> the platforms where it works?
>
> Thanks for any opinions,
>
> Jon
> ---

Straight radix sort might be an interesting option for some things.
However, its performance can depend on whether the input data is random or
not and it takes up more space than merge sort. Other potential drawbacks
arise from the bit twiddling needed for signed numbers and floats, the
former solved by converting to offset binary numbers (complement the sign
bit), and the latter in the way your links indicate, but both leading to a
proliferation of special cases. Maintaining byte order and byte addressing
portability between cpu architectures might also require masking and
shifting that will add computational expense and may lead to more special
cases for extended precision floats and so on. That said, I would be curious
to see how it works out if you want to give it a try.

==setup.py==
> from distutils.core import setup, Extension
> e = Extension("_sort_radix_uint8",sources=["radix_wrap.c"])
> setup(ext_modules=[e])
>
> ==radix.i==
> %module sort_radix_uint8
> %{
> #define SWIG_FILE_WITH_INIT
> void sort_radix_uint8(unsigned char * ar, int len){
>      int hits[256];
>      int i, j;
>      for(i=0;i<256;i++) hits[i]=0;
>      for(i=0;i<len;i++) hits[(int)ar[i]]+=1;
>      i=0; j=0;
>      while(i<256){
>          while(hits[i]>0){
>              /* shortcut for uint8 */
>              ar[j++] = (unsigned char) i;
>              hits[i]--;
>          }
>          i++;
>      }
> }
> %}
> %init %{
> import_array();
> %}
> %include "numpy.i"
> %apply (unsigned char* INPLACE_ARRAY1, int DIM1) { (unsigned char * ar,
> int len)}
> void sort_radix_uint8(unsigned char * ar, int len);

You can't use this method for longer keys, you will need to copy over the
whole number. Sedgewick's Algorithms in C++ has a discussion and
implementation of the radix sort. And, of course, Knuth also.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070619/4099c2d2/attachment.html>