[Numpy-discussion] String sort

Sat Feb 9 15:53:25 EST 2008

On Feb 9, 2008 11:50 AM, Francesc Altet <faltet at carabos.com> wrote:

> A Saturday 09 February 2008, Charles R Harris escrigué:
> > > Well, for the unicode case it wouldn't be enough by replacing
> > > 'char' by 'Py_ArrayUCS4'?  Maybe this afternoon I can do some
> > > benchmarking too in this regard.
> >
> > Looks like that for Numpy. The problem I was thinking about is that
> > for wide characters Windows C defaults to UTF16 while the Unixes
> > default to UTF32.
>
> If it were so simple ;-)  The fact is that the Python crew is delivering
> the tarballs ready to compile with the UCS2 as default, and this
> applies to both UNIX and Windows.  However, some Linux distributions
> (most in particular, Debian and derivatives), has chosen to make UCS4
> the default in their Python packages.
>
> This is not a (big) problem in itself, but when it comes to writing
> arrays on disk and hope for portability (not only with different
> platforms, but also with different UCS python interpreter in the same
> machine!), we realized that this was a real problem (see discussion in
> [1]).  So, NumPy had to make a decision in that regard, and Travis
> finally opted to only give support for the UCS4 charset in NumPy [2].
> Also, he opened the door to possible UCS2 implementations in NumPy in
> the future, but that would be a real pain, IMHO.
>
>
> [1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.html
>
> [2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.html
>
> So, at least for the time being, you only have to worry about UCS4.
>
> > The C99 standard didn't specify the exact length,
> > but Numpy seems to use (or assume) UTF32.
>
> Well, I should say that UTF32 and UCS4 are names referring to the same
> thing, but most literature (and specially package configuration
> procedures) talks about UCS4.
>
> > Anyway, after doing some work to fool the optimizer and subtracting
> > loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs
> > 16e-9 seconds to compare strings of length 10. I've attached the
> > program. Note that on my machine malloc appears to return zeroed
> > memory, so the string compares always go to the end.
>
> I've seen the benchmark, and the problem is that C strncmp stops
> checking when it finds a \0 in the first string, while strncmp1 have to
> check the complete set of chars in strings.  However, you won't really
> want to do C string comparisons with NumPy strings:
>
> In [35]: ns1 = numpy.array("as\0as")
>
> In [36]: ns2 = numpy.array("as\0bs")
>
> In [37]: ns1 == ns2
> Out[37]: array(False, dtype=bool)
>
> In [38]: ns1 < ns2
> Out[38]: array(True, dtype=bool)
>
> or, with Python strings, in general:
>
> In [39]: ns1 = "as\0as"
>
> In [40]: ns2 = "as\0bs"
>
> In [41]: ns1 == ns2
> Out[41]: False
>
> In [42]: ns1 < ns2
> Out[42]: True
>
> As you see, Python/NumPy strings are different beasts than C strings in
> that regard.  The strings in the latter always end with a \0 (NULL)
> character, while in Python/NumPy the end is defined by a length
> property (btw, the same than in Pascal, if you know it).
>
> So, strncmp1 is not only faster than its C counterpart, but also the one
> doing the correct job with NumPy (unicode) strings.
>

Ah, in that case the current indirect sort for NumPy strings, which uses
strncmp, is incorrect and needs to be fixed. It seems that strings with
zeros are not part of the current test series ;)

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080209/38083deb/attachment.html>