[Tutor] sorting algorithm

Thu Mar 11 20:47:26 CET 2010

C.T. Matsumoto wrote:
> Dave Angel wrote:
>> (You forgot to do a Reply-All, so your message went to just me, rather 
>> than to me and the list )
>>
>>
>> C.T. Matsumoto wrote:
>>> Dave Angel wrote:
>>>> C.T. Matsumoto wrote:
>>>>> Hello,
>>>>>
>>>>> This is follow up on a question I had about algorithms. In the 
>>>>> thread it was suggested I make my own sorting algorithm.
>>>>>
>>>>> Here are my results.
>>>>>
>>>>> #!/usr/bin/python
>>>>>
>>>>> def sort_(list_):
>>>>>    for item1 in list_:
>>>>>        pos1 = list_.index(item1)
>>>>>        pos2 = pos1 + 1
>>>>>        try:
>>>>>            item2 = list_[pos2]
>>>>>        except IndexError:
>>>>>            pass
>>>>>
>>>>>        if item1 >= item2:
>>>>>            try:
>>>>>                list_.pop(pos2)
>>>>>                list_.insert(pos1, item2)
>>>>>                return True
>>>>>            except IndexError:
>>>>>                pass
>>>>>
>>>>> def mysorter(list_):
>>>>>    while sort_(list_) is True:
>>>>>        sort_(list_)
>>>>>
>>>>> I found this to be a great exercise. In doing the exercise, I got 
>>>>> pretty stuck. I consulted another programmer (my dad) who described 
>>>>> how to go about sorting. As it turned out the description he 
>>>>> described was the Bubble sort algorithm. Since coding the solution 
>>>>> I know the Bubble sort is inefficient because of repeated 
>>>>> iterations over the entire list. This shed light on the quick sort 
>>>>> algorithm which I'd like to have a go at.
>>>>>
>>>>> Something I haven't tried is sticking in really large lists. I was 
>>>>> told that with really large list you break down the input list into 
>>>>> smaller lists. Sort each list, then go back and use the same 
>>>>> swapping procedure for each of the different lists. My question is, 
>>>>> at what point to you start breaking things up? Is that based on 
>>>>> list elements or is it based on memory(?) resources python is using?
>>>>>
>>>>> One thing I'm not pleased about is the while loop and I'd like to 
>>>>> replace it with a for loop.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> T
>>>>>
>>>>>
>>>> There are lots of references on the web about Quicksort, including a 
>>>> video at:
>>>>     http://www.youtube.com/watch?v=y_G9BkAm6B8
>>>>
>>>> which I think illustrates it pretty well.  It would be a great 
>>>> learning exercise to implement Python code directly from that 
>>>> description, without using the sample C++ code available.
>>>>
>>>> (Incidentally, there are lots of variants of Quicksort, so I'm not 
>>>> going to quibble about whether this is the "right" one to be called 
>>>> that.)
>>>>
>>>> I don't know what your earlier thread was, since you don't mention 
>>>> the subject line, but there are a number of possible reasons you 
>>>> might not have wanted to use the built-in sort.  The best one is for 
>>>> educational purposes.  I've done my own sort for various reasons in 
>>>> the past, even though I had a library function, since the library 
>>>> function had some limits.  One time I recall, the situation was that 
>>>> the library sort was limited to 64k of total data, and I had to work 
>>>> with much larger arrays (this was in 16bit C++, in "large" model).  
>>>> I solved the size problem by using the  C++ sort library on 16k 
>>>> subsets (because a pointer was 2*2 bytes).  Then I merged the 
>>>> results of the sorts.  At the time, and in the circumstances 
>>>> involved, there were seldom more than a dozen or so sublists to 
>>>> merge, so this approach worked well enough.
>>>>
>>>> Generally, it's better for both your development time and the 
>>>> efficiency and reliabilty of the end code, to base a new sort 
>>>> mechanism on the existing one.  In my case above, I was replacing 
>>>> what amounted to an insertion sort, and achieved a 50* improvement 
>>>> for a real customer.  It was fast enough that other factors 
>>>> completely dominated his running time.
>>>>
>>>> But for learning purposes?  Great plan.  So now I'll respond to your 
>>>> other questions, and comment on your present algorithm.
>>>>
>>>> It would be useful to understand about algorithmic complexity, the 
>>>> so called Order Function.  In a bubble sort, if you double the size 
>>>> of the array, you quadruple the number of comparisons and swaps.  
>>>> It's order N-squared or O(n*n).   So what works well for an array of 
>>>> size 10 might take a very long time for an array of size 10000 (like 
>>>> a million times as long).  You can do much better by sorting smaller 
>>>> lists, and then combining them together.  Such an algorithm can  be 
>>>> O(n*log(n)).
>>>>
>>>>
>>>> You ask at what point you consider sublists?  In a language like C, 
>>>> the answer is when the list is size 3 or more.  For anything larger 
>>>> than 2, you divide into sublists, and work on them.
>>>>
>>>> Now, if I may comment on your code.  You're modifying a list while 
>>>> you're iterating through it in a for loop.  In the most general 
>>>> case, that's undefined.  I think it's safe in this case, but I would 
>>>> avoid it anyway, by just using xrange(len(list_)-1) to iterate 
>>>> through it.  You use the index function to find something you would 
>>>> already know -- the index function is slow.  And the first 
>>>> try/except isn't needed if you use a -1 in the xrange argument, as I 
>>>> do above.
>>>>
>>>> You use pop() and push() to exchange two adjacent items in the 
>>>> list.  Both operations copy the remainder of the list, so they're 
>>>> rather slow.  Since you're exchanging two items in the list, you can 
>>>> simply do that:
>>>>     list[pos1], list[pos2] = list[pos2], list[pos1]
>>>>
>>>> That also eliminates the need for the second try/except.
>>>>
>>>> You mention being bothered by the while loop.  You could replace it 
>>>> with a simple for loop with xrange(len(list_)), since you know that 
>>>> N passes will always be enough.  But if the list is partially 
>>>> sorted, your present scheme will end sooner.  And if it's fully 
>>>> sorted, it'll only take one pass over the data.
>>>>
>>>> There are many refinements you could do.  For example, you don't 
>>>> have to stop the inner loop after the first swap.  You could finish 
>>>> the buffer, swapping any other pairs that are out of order.  You'd 
>>>> then be saving a flag indicating if you did any swaps.  You could 
>>>> keep a index pointing to the last pair you swapped on the previous 
>>>> pass, and use that for a limit next time.  Then you just terminate 
>>>> the outer loop when that limit value is 1.  You could even keep two 
>>>> limit values, and bubble back and forth between them, as they 
>>>> gradually close into the median of the list.  You quit when they 
>>>> collide in the middle.
>>>>
>>>> The resultant function should be much faster for medium-sized lists, 
>>>> but it still will slow down quadratically as the list size 
>>>> increases.  You still need to divide and conquer, and quicksort is 
>>>> just one way of doing that.
>>>>
>>>> DaveA
>>>>
>>> Thanks a lot Dave,
>>>
>>> Sorry the original thread is called 'Python and algorithms'.
>>>
>>> Yes, I think it's best to use what python provides and build on top 
>>> of that. I got to asking my original question based on trying to 
>>> learn more about algorithms in general, through python. Of late many 
>>> people have been asking me how well I can 'build' algorithms, and 
>>> this prompted me to start the thread. This is for learning purposes 
>>> (which the original thread will give you and indication where I'm 
>>> coming from).
>>>
>>> The refactored code looks like this. I have tackled a couple items. 
>>> First the sub-listing (which I'll wait till I can get the full sort 
>>> working), then the last couple of paragraphs about refinements. 
>>> Starting with the first refinement, I'm not sure how *not* to stop 
>>> the inner loop?
>>>
>>> def s2(list_):
>>>    for pos1 in xrange(len(list_)-1):
>>>        item1 = list_[pos1]
>>>        pos2 = pos1 + 1
>>>        item2 = list_[pos2]
>>>        if item1 >= item2:
>>>            list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>>            return True
>>>
>>> def mysorter(list_):
>>>    # This is the outer loop?
>>>    while s2(list_) is True:
>>>        # Calling s2 kicks off the inner loop?
>>>        s2(list_)
>>>
>>> if __name__ == '__main__':
>>>    from random import shuffle
>>>    foo = range(10)
>>>    shuffle(foo)
>>>    mysorter(foo)
>>>
>>>
>>> Thanks again.
>>>
>> As before, I'm not actually trying this code, so there may be typos.  
>> But assuming your code here works, the next refinement would be:
>>
>> In s2() function, add a flag variable, initially False.  Then instead 
>> of the return True, just say flag=True
>> Then at the end of the function, return flag
>>
>> About the while loop.  No need to say 'is True'  just use while 
>> s2(list_):  And no need to call s2() a second time.
>>
>> while s2(list_):
>>     pass
> Okay up to here I follow. This all makes sense.
> 
> def s2(list_):
>     flag = False
>     for pos1 in xrange(len(list_)-1):
>         item1 = list_[pos1]
>         pos2 = pos1 + 1
>         item2 = list_[pos2]
>         if item1 >= item2:
>             list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>             flag = True       
>             return flag
> 
> def mysorter(list_):
>     while s2(list_):
>         pass
>>
>> Before you can refine the upper limit, you need a way to preserve it 
>> between calls.  Simplest way to do that is to combine the two 
>> functions, as a nested loop.  Then, instead of flag, you can have a 
>> value "limit" which indicates what index was last swapped.  And the 
>> inner loop uses that as an upper limit on its xrange.
> Where I start to get confused is refining the 'upper limit'. What is the 
> upper limit defining? I'm guessing it is the last position processed.
> 
> T
>>
>> DaveA
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
> 

Take out the = in the following line
          if item1 >= item2:
and it will sort like items together, which I think is what you 
originally wanted.

Also change:
 >             return flag
to:
 >    return flag
So that False gets returned if you don't make a swap.

This worked for me.  Thank you for the interesting thread!

-- 
Jeff

Jeff Johnson
jeff at dcsoftware.com
Phoenix Python User Group - sunpiggies at googlegroups.com