Surprising (for me) benchmark results...

John J. Lee phrxy at csv.warwick.ac.uk
Wed May 2 19:25:26 EDT 2001


On Wed, 2 May 2001, Daniel Berln wrote:
[...]
> First, let's be clear on what Rainer did.
> He said: Disk access are O(n). (Where n is the number of bytes). Sorting
> is O(n log n) (Where n is the number of items to sort). Therefore,
> sorting is more expensive algorithmically.

Yes.

> This is clearly incorrect, you are comparing apples and oranges. The n's
> aren't representing the same things.

They represent 'number of things to sort' in both cases, unless we're
talking at cross-purposes.

> The number of bytes is the absolute upper bound on the number of items
> to sort, and would only be the case for single byte items, in which
> case, there are better sorting algorithms, and you could make disk
> accesses dominate again.

There are O(N) sorting algorithms??  I thought that was restricted to
quantum computation.

> Typically, an item is multiple bytes. Let's say the average english word
> is 5 characters ( I can't remember what the actual figure is).
> This would mean Disk accesses are O(n), Sorting is O(n/5 log n/5) where
> both n's are directly comparable. Once again, disk access clearly
> dominates.

O(N/5 log N/5) = O(N log N)

The constant factors etc. only affect the *point* at which disk access
becomes unimportant, not *whether* it does (assuming you can get to big
enough N without running out of memory).

> Until you are sorting a very large number of single byte items with
> quicksort, disk accesses will dominate.

Yes, that may well be true.  I don't know.

> Now, this is even assuming disk accesses are O(n), which they aren't.

They are O(N), but that doesn't mean that the time it takes is simply
proportional to N, does it?  Or that there won't be some variation about
the O(N) behaiviour for small changes in N?

I think we're just not using the same terminology.  Rainer was just saying
that, for large enough N, disc access will not dominate (as long as
everything fits in memory).  This is true.  I think (correct me if I'm
wrong) you're saying that we won't get to that point before we run out of
memory.  Or perhaps you're just saying that for some small N and small
change in N, the trend goes in the other direction, and in fact as N
increases disk access goes from being unimportant to being important.
These statements may or may not be true, I don't know.  It seems unlikely
that O(N) could be too far wrong for if the storage required is close to
the amount of memory in your machine and N isn't too small, at least.

> Disk's are not based on technology that is determinstic. It's a fuzzy
[...detail snipped...]

I think everyone here realises this.


John





More information about the Python-list mailing list