Bug in timsort!?

Chris Kaynor ckaynor at zindagigames.com
Tue Feb 24 19:38:48 EST 2015


On Tue, Feb 24, 2015 at 4:07 PM, Chris Angelico <rosuav at gmail.com> wrote:

> On Wed, Feb 25, 2015 at 10:50 AM, Skip Montanaro
> <skip.montanaro at gmail.com> wrote:
> > Even if/when we get to the point where machines can hold an array of
> > 2**49 elements, I suspect people won't be using straight Python to
> > wrangle them.
>
> Looking just at CPython, what is the absolute minimum storage space
> for a massive list like that? My guess is that a 64-bit CPython might
> get as good as eight bytes per element; playing around with smaller
> figures (up to circa a million elements) shows it ranging from 8 to 9
> bytes per element, bottoming out at just a smidge over 8, presumably
> at the moments when there's no slack space (there's a fixed overhead,
> which gets diminishingly significant). So an array of 2**49 elements
> would take 2**52 bytes (4 petabytes) of storage - addressable, to be
> sure, but an awful lot to be sorting.
>
> Would it be sufficient to stick a comment into the source saying "This
> may have problems with lists in excess of 2**49 elements", and leave
> it until that's actually likely to happen?
>

I would agree that it will be quite a while as it stands before the bug
appears, so documenting it MAY be appropriate, as it is likely to be far
enough in the future that a number of unforeseen things may block the bug
from being an issue (TimSort may be replaced by something better, CPython
could die, etc).

That said, from what I understand, their proposed fix would be reasonably
easy to use, and have minimal cost, so it may be worthwhile to use just
because the issue is likely to be in use for quite a while before somebody
remembers the issue exists. Once somebody does hit it, it may take a while
to realize that TimSort is at fault, as it will only occur on some data
sets (my understanding is that the elements must be in certain
configurations with at-least the minimum number), so it will likely appear
as just a random crash occurring.

If it were fully up to me, I would do one of the following, in order of
preference:
1) Implement and test the fix provided on the blog. This requires the most
work due to licensing, reviewing, and testing.
2) Add a noisy warning/error to the code when sorting lists large enough to
potentially hit the error (probably 2**48 to leave some wiggle room). This
is much easier to do, but merely pushes the can down the road to be dealt
with later. It does still make it obvious what is going long when it
finally breaks.
3) Add a comment explaining the issue (likely linking back to the blog).
While by far the easiest "solution", again, this merely pushes the can down
the road, and it may be very non-obvious what went wrong when it breaks,
despite the comment.

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150224/4747e752/attachment.html>


More information about the Python-list mailing list