[Python-Dev] Tunning binary insertion sort algorithm in Timsort.

Tue Mar 10 04:46:12 CET 2015

Thank you for your comment. I admit that I did not thinking about the proof
before.
Here is my naive proof. I hope you can give me some comments:

=============
# This proof is an empirical thinking and is not completed, it just gives
us a closer look.
I hope someone can make it more mathematically.

In the proof, we assume we are dealing with unique values array (none of
them are equal together).
Because if they are equal, the "lucky search" can happen and it is
obviously not fair.

Statement_1: With an array of size N or less than N, we need at most
log2(N) comparisons to find a value
(or a position, incase the search miss), using the binary search algorithm.

proof: This statement is trivia, and I believe, someone outthere already
proved it. We can check again
by counting manually.

    let assume we have array of 32 items:
    32 => 16 => 8 => 4 => 2 => 1  (5 comparison)

    how about 24 items (24 < 32):
    24 => 12 => 6 => 3 => 2 => 1  (5 comparison)

ok, good enough. Let's just believe on it to move on.

Statement_2: If we divide an array into two parts, the more unbalanced
arrays we divide, the more
benefit we get from the binary search algorithm.

proof: Let's assume we have an array of 256 items.

    case1:
    If we divide in middle: 128 - 128
    Now, if we search on the left, it costs log2(128) = 7
    If we search on the right, it cost los2(128) = 7

    case2:
    If we divide unbalanced: 32 - 224
    Now, if we search on the left, it costs log2(32) = 5
    If we search on the right, it cost at max 8 comparisons (based on the
statement_1).
    You may not believe me, so let's count it by hand:
    224 => 112 => 56 => 28 => 14 => 7 => 4 => 2 => 1
    So, if we search on the left, we win 2 comparisons compare to case1.
    We search on the right, we lose 1 comparison compare to case1
    I call this is a "benefit".

    case3:
    What if we divide more unbalanced: 8 - 248
    Search on the left: log2(8) = 3 comparisons.
    Search on the right, it costs at max 8 comparisons.
    So if we search on the left, we win 4 comparisons.
    We search on the right, we lose 1 comparisons.
    It is "more benefit", isnt it?

Statement3: Because we are using random array. There is a 50-50 chance that
 next_X will be bigger or smaller than X.

Statement4: We call N is the size of the sorted list, "index" is the
position of X in the sorted list.
Because the array is random, index has an equal chance to exist in any
position in the sorted list.

Statement5: Now we build a model based on previous statements:

    My idea costs 1 comparison (between X and next_X) to devide the array
into two unbalanced parts.
    The old idea costs 1 comparison to divide the array into two balanced
parts.
    Now let's see which one can find position for next_X faster:

    If index belongs to [N/4 to 3N/4]: we may lose 1 comparison, or we may
not lose.
    If index belongs to [N/8 to N/4] or [3N/4 to 7N/8]: We may lose 1
comparison, or we win 1 comparison.
    If index belongs to [N/16 to N/8] or [7N/8 to 15N/16]: We may lose 1
comparison, or we win 2 comparison.
    If index belongs to [N/32 to N/16] or [15N/16 to 31N/32]: We may lose 1
comparison, or we win 3 comparison.
    If index belongs to [N/64 to N/32] or [31N/32 to 64N/64]: We may lose 1
comparison, or we win 4 comparison.
    ...
    and so on.

Statement6: Now we apply the model to a real example.

    Assume that we already has a sorted list with 16 items. And we already
know about "index" of X.
    We can think of it as a gamble game with 16 slots. In every slot, we
only can bid 1 dollar (statement4).

    From slot 5th to slot 12th, we may lose 1, or we may not lose, 50-50
chance.
    So after a huge play times, probability told us that we will lose (8 x
1)/2 = 4 dollars.

    For slot 3, slot 4, slot 13, slot 14, We may lose 1, or we win 1. So
after a huge play times,
    We wont lose or win anything.

    For slot 2, slot 15. We may lose 1, or we win 2. So after a huge play
times, we can win
    (2-1)x2 = 2 dollars.

    For slot 1, slot 16. We may lose 1, or we win 3. So after a huge play
times, we can win 4 dollars.

    In total, after a huge play times, we win 4 + 2 + 0 -4 = 2 dollars !!!!!

    You can test with sorted list 32 or 64 items or any number you want, I
believe the benefit is even more.

Conclusion:
    The unbalanced model give us more benefit than the balanced model. That
means with an array big enough,
    My idea give more benefit than the old idea.

    I think the lucky ticket companies is already know about this. It is a
shame that I do not know
    mathematic principle about this problem.

If I have something more, I will update my proof at:
https://github.com/nhapq/Optimize_binary_insertion_sort/blob/master/proof.txt

======
Thank you.
Nha Pham.

On Mon, Mar 9, 2015 at 10:39 AM, Isaac Schwabacher <ischwabacher at wisc.edu>
wrote:

> On 15-03-08, nha pham
>  wrote:
> >
> > We can optimize the TimSort algorithm by optimizing its binary insertion
> sort.
> >
> > The current version of binary insertion sort use this idea:
> >
> > Use binary search to find a final position in sorted list for a new
> element X. Then insert X to that location.
> >
> > I suggest another idea:
> >
> > Use binary search to find a final postion in sorted list for a new
> element X. Before insert X to that location, compare X with its next
> element.
> >
> > For the next element, we already know if it is lower or bigger than X,
> so we can reduce the search area to the left side or on the right side of X
> in the sorted list.
>
> I don't understand how this is an improvement, since with binary search
> the idea is that each comparison cuts the remaining list to search in half;
> i.e., each comparison yields one bit of information. Here, you're spending
> a comparison to cut the list to search at the element you just inserted,
> which is probably not right in the middle. If you miss the middle, you're
> getting on average less than a full bit of information from your
> comparison, so you're not reducing the remaining search space by as much as
> you would be if you just compared to the element in the middle of the list.
>
> > I have applied my idea on java.util. ComparableTimSort.sort() and
> testing. The execute time is reduced by 2%-6% with array of random integer.
>
> For all that, though, experiment trumps theory...
>
> > Here is detail about algorithm and testing:
> https://github.com/nhapq/Optimize_binary_insertion_sort
> >
> > Sincerely.
> >
> > phqnha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150309/b8036606/attachment.html>