[Python-ideas] Sorted lists

Nathaniel Smith njs at pobox.com
Mon Apr 8 05:55:54 EDT 2019


On Mon, Apr 8, 2019, 02:09 Steven D'Aprano <steve at pearwood.info> wrote:

> On Sun, Apr 07, 2019 at 08:26:24PM -0700, Nathaniel Smith wrote:
> > On Sun, Apr 7, 2019 at 7:37 PM Steven D'Aprano <steve at pearwood.info>
> wrote:
> > > There are quite a few important algorithms which require lists to be
> > > sorted. For example, the bisect module, and for statistics median and
> > > other quantiles.
> >
> > But this flag doesn't affect those modules, right? 'bisect' already
> > requires the user to ensure that the list is sorted appropriately
>
> Naturally the bisect and statistics modules (or any other that requires
> sorting) won't change to inspect this flag by magic, the code will
> require modification.


Right, by "doesn't affect" I meant "cannot get any benefit, even if their
code is modified".

Possibly the maintainer of bisect may decide that its not worth the
> change. But for the statistics module, I would certainly change the
> implementation of median() to look something vaguely like this:
>
>     # was
>     data = sorted(data)  # may be expensive if data is large
>
>     # may become
>     if not (isinstance(data, list) and data.__issorted__):
>         data = sorted(data)


>
> statistics is soon to grow a quantiles() function, but the thing with
> quantiles is often you want to get a bunch of them:
>
>     # This only makes sense if data is a sequence (list)
>     # not an iterator.
>     quartiles = quantiles(data, n=4)
>     quintiles = quantiles(data, n=5)
>     deciles = quantiles(data, n=10)
>     percentiles = quantiles(data, n=100)
>

If only we had some kind of API that could compute multiple quantiles at
the same time...


>
> That's four calls to sorted(). The caller could drop that down to one:
>
>     data.sort()
>     quartiles = ... etc
>
>
> Now before anyone else mentions it, we could give the function a
> "dont_sort" argument, or "already_sorted" if you prefer, but I dislike
> that kind of constant-bool parameter and would prefer to avoid it.
>
>
> > and this bit:
> >
> > > The flag doesn't guarantee that the list is sorted the way you want
> > > (e.g. biggest to smallest, by some key, etc) only that it has been
> > > sorted. Its up to the user to ensure they sort it the right way:
> >
> > ...seems to mean that the 'statistics' module can't use this flag either.
>
> "Consenting adults" applies. If you want to pass an unsorted list to the
> functions, but pretend that its sorted, then on your own head be it.
> There's no real difference between these two hypothetical scenarios:
>
>     data = [1, 4, 2, 0, 5, 3]
>     garbage = median(data, already_sorted=True)
>
> versus:
>
>     data = [1, 4, 2, 0, 5, 3]
>     data.__issorted__ = True
>     garbage = median(data)
>
>
> I'm perfectly comfortable with allowing the caller to lie if they want.
> Its their own foot they're shooting.
>

An already_sorted=True argument would be an explicit opt in, and consenting
adults would apply. But your message was very explicit that __issorted__
can be set implicitly, though. For example, this would give garbage results:

# implicitly sets the sorted flag
data.sort()
# preserves the flag, because hey it's sorted by *some* key
data.reverse()
statistics.median(data)

You can't use this in statistics.median because it would break
compatibility. Also, isn't the whole point of 'statistics' to be the
simple, reliable module for folks who aren't that worried about speed? This
seems like a massive footgun.


> (I wouldn't be so blasé about this if it were a function written in C
> that could segfault if the list wasn't sorted.)
>

Silently giving the wrong answer is way worse than a segfault.

> It doesn't seem very likely to me that the savings from this flag
> > could outweigh the extra overhead it introduces, just because list
> > operations are *so* common in Python. If you want to push this
> > forward, the thing I'd most like to see is some kind of measurements
> > to demonstrate that average programs will benefit.
>
> I'm not sure that the average program uses sort *at all*, so a better
> set of questions are:
>
> - how much will this hurt the average program?
>   (my gut feeling is "very little", but we'd need benchmarks to know)
>
> - are there any other use-cases for sorted data that could benefit?
>
> - how much will they benefit?
>
> Let's say, for the sake of the argument that this proposal makes the
> average program 0.01% slower, but helps sorting-heavy programs be 2%
> faster when dealing with large lists, then I think that might be a win.
>

Obviously these are made up numbers, but if they were real then for it to
be a net win you would still need at least 1 in 200 programs to be "sorting
heavy" in a way that could benefit from this flag, and I don't believe
that's true.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190408/bf797c1e/attachment-0001.html>


More information about the Python-ideas mailing list