Balanced trees

Mon Mar 17 21:01:24 EDT 2014

On Mon, Mar 17, 2014 at 5:08 PM, Joshua Landau <joshua at landau.ws> wrote:

> Thanks.  First, I want to state that there are two aspects to my
>  claim. The first is that these benchmarks to not represent typical
> use-cases. I will not go too far into this, though, because it's
> mostly obvious.
>

I would love to have include macro-benchmarks.  I keep waiting for the PyPy
benchmark suite to get ported to Python 3...

> "Create from an iterator" gives me relatively different results when I
> run it (Python 3).
>

The graphs were originally created to compare vanilla Python with a Python
modified to use blist as the built-in list type.  I think I used Python
3.1, but I'm not certain.  As I recall, the built-in type has a few small
advantages over any third-party extension type, so that might be what
you're seeing.  Alternately, something may have changed between Python
versions.

> "Delete a slice" is fudged from its inclusion of multiplication, which
> is far faster on blists. I admit that it's not obvious how to fix
> this.
>

I could move the initialization into the timed part, similar to what I did
for sort (see below).  That has downsides too, of course, but it might be
an improvement.

> "First in, first out (FIFO)" should be "x.append(0); x.pop(0)".
>

Wow, I mangled that one badly.

> "Last in, first out (LIFO)" should use "pop()" over "pop(-1)",
> although I admit it shouldn't make a meaningful difference.
>

I like pop(-1) because it's explicit rather than implicit.  I agree it
shouldn't make a meaningful difference.

> "Sort *" are really unfair because they put initialisation in the
> timed part

That's a limitation of timeit.  The setup step is only executed once.  If I
put the initialization there, every sort after the first one would be
sorting a pre-sorted list.  If you compare the "Create form an iterator"
and "Sort a random list", you'll see that the initialization cost is
dwarfed by the sorting cost for n > 15 or so.

> and all have keys.

If you use classes with __lt__ methods instead of keys, the cost is
dominated by the calls to __lt__.  You're right that I should include both,
though.

> >>> python -m timeit -s "from random import choice; import blist; lst =
> blist.blist(range(10**0))" "choice(lst)"
> 1000000 loops, best of 3: 1.18 usec per loop
>
> >>> python -m timeit -s "from random import choice; import blist; lst =
> blist.blist(range(10**8))" "choice(lst)"
> 1000000 loops, best of 3: 1.56 usec per loop
>
> Lower size ranges are hidden by the function-call overhead.
> Perhaps this effect is to do with caching, in which case the limits of
> the cache should be explained more readily.
>

That's definitely a cache issue, which is always a risk with
micro-benchmarks.  I see growth even for the built-in list:

gnusto:~$ python -m timeit -s "from random import choice; lst =
list(range(10**0))" "choice(lst)"
1000000 loops, best of 3: 0.349 usec per loop
gnusto:~$ python -m timeit -s "from random import choice; lst =
list(range(10**8))" "choice(lst)"
1000000 loops, best of 3: 0.634 usec per loop

I agree it's more interesting to pick items randomly instead of always
querying the same index.  The overhead of choice() is kind of a problem,
though.  Since I'm only plotting up to 10**5, I'd expect these to look more
or less flat.

Thanks for all of the feedback.  I filed a bug with myself to improve the
metrics:
https://github.com/DanielStutzbach/blist/issues/64

-- 
Daniel Stutzbach
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140317/491a2043/attachment.html>