[2.5.1.1/dictionary] Change sorting order?

Fri Jan 22 13:40:26 EST 2010

Steven D'Aprano wrote:
> On Fri, 22 Jan 2010 09:49:32 -0500, Dave Angel wrote:
>
>   
>> Seems to me the other solutions I've seen so far are more complex than
>> needed.  I figure you either want an unordered list, in which case you
>> could use random.shuffle(), or you want a list that's sorted, but starts
>> somewhere in the middle, at an arbitrary place, goes to the end, and
>> wraps back to the beginning.  So use random.randint() to choose an index
>> within the list, and concatenate two slices of the list, based on that
>> index in reverse order.
>>
>> And don't bother generating it each minute, but simply generate it each
>> time you need it.
>>     
>
> According to the OP's stated requirements, he needs it every minute.
>
>   
I read it as he wants viewers of the page that are more than a minute 
apart to see a different order.
> Personally, from a usability perspective, I think a refresh of a 
> (potentially huge) list of users every minute would be horrible. I think 
> the whole design needs to be rethought, but that's not my decision.
>
>
>   
Certainly if there are more than 100 users or so, the idea of presenting 
a web page with a single list of users is unwieldy.  So I assumed this 
problem did not have to scale.  Once the list goes on multiple pages, 
entirely different algorithms are appropriate, along with the display 
mechanisms to support them.
>> I doubt if the cost of generating it is much
>> different than the cost of checking the time.
>>     
>
> Checking the time is quite fast -- it's a simple system call. Slicing and 
> copying a list containing potentially tens or hundreds of thousands of 
> names is anything but cheap!
>
>   
>>>> from timeit import Timer
>>>> t1 = Timer('x = time.time()', 'import time')
>>>> t2 = Timer('y = x[:1000] + x[1000:]', 'x = [None]*100000')
>>>>
>>>> t1.timeit(1000)
>>>>         
> 0.0025529861450195312
>   
>>>> t2.timeit(1000)
>>>>         
> 3.0900199413299561
>
> Slicing is more than 1000 times slower than checking the time. If you 
> think I'm being unfair, that the OP's application will never have 100,000 
> names, try it with 1000 names:
>
>   
>>>> t3 = Timer('y = x[:1000] + x[1000:]', 'x = [None]*1000')
>>>> t3.timeit(1000)
>>>>         
> 0.041487932205200195
>
> That's 20 times slower than checking the time.
>
> That's the advantage of my suggestion: the only slicing done is the tiny 
> master list, which means copying 27 pointers. Easy and fast, and it 
> doesn't matter whether you have one user or one million, that operation 
> will take exactly the same amount of time. Your suggestion to keep all 
> the users in one giant list, and slice it repeatedly, scales terribly.
>
>
>   
>> I guess there's a third possibility, that you don't want some of the G
>> names at the beginning, and some at the end.  In that case, I'd generate
>> the random index as above, then increment it till the first character of
>> the item changes.  Use that index as your split point.
>>     
>
> Another solution that doesn't scale terrible well. (Although not as bad 
> as the previous one.)
>
> And what's with the random index thing? Why are you throwing away 
> perfectly good information every time you want to shift letters? Your 
> solution is something like this:
>
> You have a book opened at the start of Chapter 11, and you now want to 
> jump to the start of Chapter 12.
>
> (1) Open the book at a random page.
> (2) Is it the start of Chapter 12?
>   
Wrong.  We're checking whether it's the start of whatever the next 
chapter is, so if we randomly jumped into chapter 5, we'd be going a 
page at a time till chapter 6.  After I thought about it, I realized we 
should be going backwards, not forward, but the idea is the same.  We're 
only walking through a half chapter, typically.
>     If yes, you're done.
>     If no, turn to the next page. If you reach the end of 
>     the book, start again at the beginning.
> (3) Go to step 2.
>
> There are search techniques that start with a random index, but they 
> don't advance one position at a time! (E.g. binary search, or hunt-and-
> bisect search, or similar.) But why search each time when you can keep an 
> index to the start of each chapter and jump straight there in constant 
> time?
>
>   
First, the time that will be spent turning this list into a web page 
will far exceed any of these numbers.

And if we assume he's using CGI, then the time spent reloading and 
unmarshaling the list will exceed by even more.

Next, when it comes to an actual implementation, I used bisect(), which 
is O(logn).

And finally, it was more important to get the concepts together, as the 
spec wasn't well-thought out, than it was to get some optimal solution.  
That's why I used slice instead of doing the index-len() trick.  In real 
code, he'll be looping over the list, calling some logic to add <p> and 
</p> for example.  By running that loop from index-len() to index, it 
can operate on the original list.

In my opinion, the only "fair" answer is to choose the starting point 
randomly, or sequentially, but without regard to starting letter.  And 
random has the distinct advantage of not needing to store state between 
CGI invocations.

The rest are premature optimizations, in my opinion.   But once a final 
spec is decided, if the letters wanted to be equally weighted, then a 
dictionary of sorted lists might very well be the best storage 
mechanism.  And once each starting letter gets its own web page, it also 
makes sense.

DaveA