[Python-Dev] Comparison speed

Wed, 16 May 2001 04:41:47 -0400

[Martin]
> Producing numbers is easy :-)

If only making sense of them were too <0.6 wink>.

> I've instrumented my version where string implements richcmp, and
> special-cases everything I can think of.

1. String objects are also equal despite being different objects,
   if their ob_sinterned pointers are equal and non-NULL.  So if
   you're looking for every trick in & out of the book, that's
   another one.

2. But the real goal is to add only those special cases that in
   combination yield the largest net win, and that's much harder
   to determine (since there are no typical apps, and it's very
   hard to quantify the tradeoffs here in a credible x-platform
   x-app way).

> Counting is done for running the test suite. With this, I get
>
> Calls to string_richcompare:   2378660
> Calls with different types:      33992 (ie. one is not a string)
> Calls with identical strings:   120517
> Calls where lens decide !EQ:   1775716
> ----------------------------
> Calls richcmp -> oldcomp:       448435
> Total calls to oldcomp:        1225643
> Calls oldcomp -> memcmp:        860174
>
> So 5% of the calls are with identical strings, for which I can
> immediately decide the outcome.

But also at the cost of doing a fruitless compare and branch in 95% of calls.
There isn't enough data to guess whether this is a net win or a net loss
(compared to leaving this special case out).

Note that if the "identical string pointers" special case is a net win, it
would be effective inside oldcomp instead (i.e., you don't need a richcompare
slot to exploit it); indeed, it may be more effective there, since there are
some 800,000 calls to oldcmp that *didn't* come from richcmp, and oldcmp
doesn't check for pointer equality now (but PyObject_Compare does, so there
didn't *used* to be any point to it in oldcmp).

Any idea where those 800,000 virgin calls to oldcomp are coming from?  That's
a lot.

> 75% can be decided in terms of the string lengths, which leaves ca. 19%
> for cases where lexicographical comparison is needed.

So about 1 in 5 times there's also the additional (wrt just calling oldcmp
all the time) overhead of a second function call (i.e., the call to oldcmp
made by richcmp).

> In those cases, the first byte decides in 30%. If I remove the test
> for "len decides !EQ", I get
>
> #riches:                       2358322
> #riches_ni:                      34108
> #idents_decide:                 102050
> #lens_decide:                        0
> --------------------------------------
> rest(computed):                2222164
> #comps:                        2949421
> #memcmps:                       917776
>
> So still, ca. 30% can be decided by first byte.

Sorry, I couldn't follow this part, except noting that 917776 is about 30% of
2949421, in which case I would have expected you to say that 70% can be
decided by first byte.

> It still appears that the total number of calls to memcmp is higher
> when the length is not taken into consideration.

Since 917776 is larger than the earlier 860174, isn't that plain?  BTW, some
compilers inline memcmp, so assuming it's "a call" is a x-platform trap; of
course assuming it *isn't* is also a x-platform trap.

> To verify this claim, I've counted the cases where the length
> decides the outcome, but looking at the first byte also had:
>
> lens_decide:                    1784897
> lens_decide_firstbyte_wouldhave:1671148
>
> So in 6% of the cases, checking the length alone gives a decision
> which looking at the first byte doesn't; plus it saves a function
> call.

OTOH, 19% of all richcmp calls ended up calling oldcmp too, so the *net*
effect is muddy at best.

> To support the thesis that Py_EQ is the common case for strings, I
> counted the various operations:
>
> pyEQ:2271593
> pyLE:9234
> pyGE:0
> pyNE:20470
> pyLT:22765
> pyGT:578

This clearly wasn't doing much sorting of strings (or of tuples containing
strings, etc) -- .sort() never uses pyEQ (it only uses pyLT).

> Now, that might be flawed since comparing strings for equal is
> extremely frequent in the testsuite. To give more credibility to the
> data, I also ran setup.py with my instrumented ./python:

In the absence of non-trivial use of sorting or the bisect module or one of
the search tree modules out there, it's easy to buy that PyEQ is most common
for strings.  What's not clear is that adding a rich comparison slot actually
helps overall (as compared to continuing to let string_compare() handle it,
and if the pointer equality test actually saves more than it costs, adding it
there instead).  It's clearer that this is going to hurt sorting (& bisect
etc), by adding yet another layer of function call to get Py_LT resolved (as
for dict compares too, the string richcmp can't do anything to speed up Py_LT
that string oldcmp can't do just as efficiently -- indeed, that's the great
advantage oldcmp's "compare first character" test had:  that *can* decide
Py_LT in one byte much of the time (but length comparison cannot)).

Note too earlier mail about how adding a richcmp slot to strings will
suddenly slow cmp(string1, string2) (which is the usual way to program a
search tree, because cmp() *used* to call a string comparison routine only
once; but after adding a richcmp slot, each cmp(string1, string2) will call
the richcmp slot from 1 thru 3 times (data-dependent)).

> ...
> That shows that optimizing for Py_NE is not worth it. With these data,
> I'll upload a patch to SF.

Which is here:

http://sourceforge.net/tracker/index.php?func=detail&aid=424335&
    group_id=5470&atid=305470

Heh:  let's grab all the ugly URLs off of SourceForge, stick them in a giant
list, and sort them.  Can't think of a more typical app than that <wink>.

Thanks for the work, Martin!