on sorting things

Peter Otten __peter__ at web.de
Wed Jan 29 11:32:40 EST 2020


Tony Flury via Python-list wrote:

> 
> On 20/12/2019 18:59, Peter Otten wrote:
>> Chris Angelico wrote:
>>
>>> On Sat, Dec 21, 2019 at 5:03 AM Peter Otten <__peter__ at web.de> wrote:
>>>> PS: If you are sorting files by size and checksum as part of a
>>>> deduplication effort consider using dict-s instead:
>>> Yeah, I'd agree if that's the purpose. But let's say the point is to
>>> have a guaranteed-stable ordering of files that are primarily to be
>>> sorted by file size - in order to ensure that two files are in the
>>> same order every time you refresh the view, they get sorted by their
>>> checksums.
>> One thing that struck me about Eli's example is that it features two key
>> functions rather than a complex comparison.
>>
>> If sort() would accept a sequence of key functions each function could be
>> used to sort slices that compare equal when using the previous key.
> 
> You don't need a sequence of key functions : the sort algorithm used in
> Python (tim-sort) is stable - which means if two items (A &B) are in a
> given order in the sequence before the sort starts, and A & B compare
> equal during the sort, then after the sort A & B retain their ordering.

Thank you for explaining that ;)
> 
> So if you want to sort by file size as the primary and then by checksum
> if file sizes are equal - you sort by checksum first, and then by file
> size: this guarantees that the items will always be in file size order -
> and if file sizes are equal then they will be ordered by checksum.
> 
> The rule to remember - is sort in the reverse order of criteria.

The idea behind the "sequence of key functions" suggestion is that you only 
calculate keys[n] where keys[:n] all compared equal.

Example: you have 1000 files and among those there are five pairs with equal 
size. Let's assume calculating the size is instantaneous whereas calculating 
the checksum takes one second. Then you spend 10 seconds on calculating the 
keys with my proposal versus 1000 seconds with a naive reliance on stable 
sorting.



More information about the Python-list mailing list