[Python-ideas] INSANE FLOAT PERFORMANCE!!!

Tue Oct 11 20:25:16 EDT 2016

To answer your question: I special-case unicode (strings), ints, and
floats. I am working on special-casing tuples (can even be different types,
just need homogeneity column-wise). The best speedups will be tuples of
floats: it'll bypass three layers of useless checks.

If I run it without special-casing floats (just using tp->rich_compare) I
only get like a 5-10% speedup. I'm working on rigorous benchmarks for all
this stuff, will post a pdf along with the patch once it's all done. But
it's certainly <10%. However, part of this is because my special case is
really low-level; for strings I've actually found the opposite, using
tp->richcompare gives me almost the same results as my special case
compare, since it still has to PyUnicode_READY the strings (or whatever
it's called).

Regarding generalization: the general technique for special-casing is you
just substitute all type checks with 1 or 0 by applying the type assumption
you're making. That's the only way to guarantee it's safe and compliant.

Elliot

On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett <jimjjewett at gmail.com> wrote:

> Excellent.
> I'm surprised cache didn't save more, but less surprised than I was ... I
> hadn't realized that you were skipping the verifications in
> PyFloat_RichCompare as well.  Does that generalize to other element types
> without exposing too much of the per-type internals to list.sort?
>
> Oh ... and I appreciate your not quoting private email as a general
> courtesy, but I hereby give you permission if it was mine that was private.
> [Though I think your summary was better than a quote anyhow.]
>
> -jJ
>
> On Oct 11, 2016 4:58 PM, "Elliot Gorokhovsky" <
> elliot.gorokhovsky at gmail.com> wrote:
>
> So I got excited here. And the reason why is that I got those numbers *on
> Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
> figured there was probably a problem with they way I was timing, and
> certainly the gains couldn't be as extreme as they suggested. But this is
> on a benchmark that's already in the codebase!
>
>
> Here is a detailed explanation of how to reproduce my results, and the
> circumstances that would lead them to be invalid:
>
> ******************************************
>
> To reproduce, just activate a virtualenv, and then clone
> https://github.com/embg/python-fast-listsort.git. Then python setup.py
> install and python sortperf.py.
>
>
> Now let's look at what sortperf.py does and how it relates to Tim's
> benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made
> three changes:
>
>
> 1. I added an import, "import fastlist". This obviously would not make
> sorting twice faster.
>
>
> 2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" +
> " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again
> irrelevant.
>
>
> 3. I changed the timing function
>
> #from this
>
>
> def doit_fast(L):
>     t0 = time.perf_counter()
>     L.fastsort()
>     t1 = time.perf_counter()
>     print("%6.2f" % (t1-t0), end=' ')
>     flush()
>
>
>
> #to this
>
>
> def doit(L):
>     F = FastList(L)
>     f0 = time.perf_counter()
>     F.fastsort()
>     f1 = time.perf_counter()
>     F = FastList(L)
>     t0 = time.perf_counter()
>     F.sort()
>     t1 = time.perf_counter()
>     print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
>     flush()
>
>
> *******************************************
>
> So what we've shown is that (1) if you trust the existing sorting
> benchmark and (2) if my modification to doit() doesn't mess anything up (I
> leave this up to you to judge), then the measurements are as valid. Which
> is a pretty big deal (50% !!!!!!!), hence my overexcitement.
>
> ****************************************
>
>
> Now I'd like to respond to responses (the one I'm thinking of was off-list
> so I don't want to quote it) I've gotten questioning how it could be
> possible for such a small optimization, bypassing the typechecks, to
> possibly have such a large effect, even in theory. Here's my answer:
>
> Let's ignore branch prediction and cache for now and just look at a high
> level. The cost of sorting is related to the cost of a single comparison,
> because the vast majority of our time (let's say certainly at least 90%,
> depending on the list) is spent in comparisons. So let's look at the cost
> of a comparison.
>
> Without my optimization, comparisons for floats (that's what this
> benchmark looks at) go roughly like this:
>
> 1. Test type of left and right for PyObject_RichCompare (which costs two
> pointer dereferences) and compare them. "3 ops" (quotes because counting
> ops like this is pretty hand-wavy). "2 memory accesses".
>
> 2. Get the address of the float compare method from
> PyFloat_Type->tp_richcompare. "1 op". "1 memory access".
>
> 3. Call the function whose address we just got. "1 op". "Basically 0
> memory accesses because we count the stack stuff in that 1 op".
>
> 4. Test type of left and right again in PyFloat_RichCompare and compare
> both of them to PyFloat_Type. "4 ops". "2 memory accesses".
>
> 5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
> "2 ops". "2 memory accesses".
>
> 6. Compare the floats and return. "2 ops".
>
> Now let's tally the "cost" (sorry for use of quotes here, just trying to
> emphasize this is an intuitive, theoretical explanation for the numbers
> which doesn't take into account the hardware):
> "13 ops, 7 memory accesses".
>
> Here's what it looks like in my code:
>
> 1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".
>
> 2. Compare the floats and return. "2 ops".
>
> Tally: "4 ops, 2 memory accesses".
>
> Now you can argue branch prediction alleviates a lot of this cost, since
> we're taking the same branches every time. But note that, branch prediction
> or not, we still have to do all of those memory acceses, and since they're
> pointers to places all over memory, they miss the cache basically every
> time (correct me if I'm wrong). So memory-wise, we really are doing
> something like a 7:2 ratio, and op-wise, perhaps not as bad because of
> branch prediction, but still, 13:4 is probably bad no matter what's going
> on in the hardware.
>
> Now consider that something like 90% of our time is spent in those steps.
> Are my numbers really that unbelievable?
>
> Thanks for everything, looking forward to writing this up as a nice latex
> doc with graphs and perf benchmarks and all the other rigorous goodies, as
> well as a special case cmp func for homogeneous tuples and a simple patch
> file,
>
> Elliot
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161012/79d6adc5/attachment.html>