Surprising (for me) benchmark results...

Wed May 2 03:16:48 EDT 2001

On Wednesday, May 2, 2001, at 01:05 AM, John J. Lee wrote:

> On Tue, 1 May 2001, Daniel Berlin wrote:
>> On Wed, 2 May 2001, Rainer Deyke wrote:
>>> "Courageous" <jkraska1 at san.rr.com> wrote in message
>>> news:rrouet0f4un3h3eo8pq6nlk5idgkmhhhca at 4ax.com...
>>>> On Wed, 02 May 2001 01:11:50 GMT, "Rainer Deyke" 
>>>> <root at rainerdeyke.com>
>>> wrote:
>>>>> "Daniel Berlin" <dan at www.cgsoftware.com> wrote in message
>>>>> news:mailman.988762570.8591.python-list at python.org...
>>>>>> Actually, the sort is probably not the bottleneck.
>>>>>> As the file gets larger, the bottleneck becomes disk time, not sort
>>> time.
>>>>> Disk access is O(N); sorting is typically O(N log N).  Therefore as 
>>>>> the
> [...]
>>>> Disregarding the algorithm and this specific situation in question, 
>>>> you
>>> are
>>>> both correct; this is because while disk access may scale O(N), this
>>>> scaling is multiplied by a huge constant factor that can in many 
>>>> cases
>>>> dwarf the sorting algorithm, depending on the number of entries and
>>>> the like.
>
> The best answer, of course.  Mind you, I don't know whether the curves
> *ever* cross for feasible N.  Does anyone here know, though direct
> experience or calculation?  The other thing is that for small changes in
> N, disk access doesn't necessarily go as N (caches, that sort of 
> thing?),
> and maybe sorting doesn't go exactly as N ln N.
It's not just caches.

First, let's be clear on what Rainer did.
He said: Disk access are O(n). (Where n is the number of bytes). Sorting 
is O(n log n) (Where n is the number of items to sort). Therefore, 
sorting is more expensive algorithmically.
This is clearly incorrect, you are comparing apples and oranges. The n's 
aren't representing the same things.

The number of bytes is the absolute upper bound on the number of items 
to sort, and would only be the case for single byte items, in which 
case, there are better sorting algorithms, and you could make disk 
accesses dominate again.

Typically, an item is multiple bytes. Let's say the average english word 
is 5 characters ( I can't remember what the actual figure is).
This would mean Disk accesses are O(n), Sorting is O(n/5 log n/5) where 
both n's are directly comparable. Once again, disk access clearly 
dominates.
Until you are sorting a very large number of single byte items with 
quicksort, disk accesses will dominate.

Now, this is even assuming disk accesses are O(n), which they aren't.

Disk's are not based on technology that is determinstic. It's a fuzzy 
technology. You can get probabilistic guarantees, like it'll be 99% of 
the time O(n), but a blanket statement that Disk access is O(n) is plain 
wrong.  This is because disk drives ain't an exact technology.  Most 
drives these days use PRML, Partial Response Maximum Likelihood. On 
disks, you have inter-symbol interference due to overlap of analog 
signal peaks. PRML uses digital filtering to handle this.  It then uses 
various other signal processing algorithms, combined with maximum 
likelihood data detection to figure out what the most likely sequence of 
bits was that it just read from the disk.  As you would imagine, this is 
some pretty hairy shit to analyze the complexity of if you want to try 
to say it's deterministic, because reading the same bits twice, even 
from the same starting position, will quite possibly give you two 
slightly different analog signals to analyze, etc. In reality, when you 
get down to it, it's non-deterministic in the average case because you 
end up having to include probability in the calculations. Worst case is 
much easier, and is deterministic, because there is no probability 
involved.
Suffice to say, it's not O(n) worst case, and it's not even really 
deterministic in pretty much any case, it just looks like it is, because 
your disk isn't writing exactly your data, just something that it can 
pretty accurately determine to be your data when it reads it back.  
Error correction codes can take care of the rest. If they are using reed 
solomon, figuring out which bits are wrong requires solving simultaneous 
equations with t unknowns (where 2t=n-k, k is the number of data symbols 
of s bits each, and n is the number of symbols, including the new parity 
symbols (IE 2t+k = n)).
Bet you never realize it was really all this complex for a seemingly 
simple fucking disk access, no?
:)

>>> I was responding to this: "As the file gets larger, the bottleneck 
>>> becomes
>>> disk time, not sort time."  This is just plain incorrect (assuming 
>>> larger
>>> file = more entries, not longer entries).
>> No, it's not.
>> You still have to get your damn file off the disk.
>> He was timing a sorting program in it's entirety, not the sort.
>> Therefore, the disk i/o time matters. And with it taking such a short 
>> time
>> for each program run, it could possibly matter a *whole lot*.
>>> If disk access dominates  on large
>>> files, it dominates even more on small files.
>> Err, that doesn't make it not dominate on small files.
>
> And if it (disk access) dominates on small files, then it dominates less
> on large files, until eventually the sort dominates.  Mm?

> To repeat what you said again:
>
>>>>>> As the file gets larger, the bottleneck becomes disk time, not sort
>>> time.
>
Yeah, yeah, i said it wrong. I know.  The bottleneck is only sort time 
in the shorter case because the worst case disk time never gets factored 
in, since it happens so infrequently in practice.  Disk time then 
becomes the bottleneck because even the 99% range for a disk access * n 
disk accesses starts being large than n/the size of the items log n/the 
size of the items.

You can't even say to a certainty that disk time doesn't dominate in the 
short case, it only doesn't some percentage of the time.

Now maybe we are all starting to realize why turing machines have 
constant time memory access.
:)

> (you can't get any more smug and annoying than a good 'Mm?', can you? :)
>
>
> John
>
> --
> http://mail.python.org/mailman/listinfo/python-list