Surprising (for me) benchmark results...

Wed May 2 08:05:07 EDT 2001

Daniel Berln <dan at cgsoftware.com> wrote:
> Now, this is even assuming disk accesses are O(n), which they aren't.

> Disk's are not based on technology that is determinstic. It's a fuzzy 
> technology. You can get probabilistic guarantees, like it'll be 99% of 
> the time O(n), but a blanket statement that Disk access is O(n) is plain 
> wrong.

Sorry, but your analysis is faulty.  Lets assume that the time it takes to
read n blocks of data is t(n), and that we can express this as the linear
function

  t(n) < a*n + b, when n > c

for some arbitrarily large, but constant, a, b and c.  Lets say that a is
100 seconds and b is 10 seconds and c is 100 blocks.  This means that in
amortized time, each block will be read within 100 seconds with a constant
overhead of 10 seconds when more than 100 blocks are to be read.

This sounds pretty reasonable, 'eh?  

Well, that is what O(n) means.  I'd say that it safe to say that disk access
is O(n), in the absence of pathological operating system conditions.  If
not, a lot of database researchers who make good livings under this
assumption will want to have a word with you.

> This is because disk drives ain't an exact technology.  Most 
> drives these days use PRML, Partial Response Maximum Likelihood.

Sure -- no biggie -- integrate the probability function times the expected
read time for each read attempt add in some processing and transfer overhead
and you'll get an expected read time.  Given enough blocks, this number will
follow some predictable distribution from which a reasonable upper bound may
be obtained.  Inflate that expected upper bound by a nice constant factor
('a' from above) and you're home free.  Sure, it will be wrong once in a
vanishingly small number of times, but under most reasonable assumptions
this can be amortized in with other blocks that are read.  So when dealing
with big-O notation, the small amount of non-determinism is the least of our
worries.

(btw, if you don't buy my argument, there is a very well developed
theoretical basis for the probabilistic analysis of algorithms.  If you are
interested I can supply many wonderful references.)

> In reality, when you get down to it, it's non-deterministic in the average
> case because you end up having to include probability in the calculations.
> Worst case is much easier, and is deterministic, because there is no
> probability involved.

The analysis of many algorithms involves integrating over a probability
distribution function to find the average case time complexity.  While the
algebra may get hairy, there is nothing wrong with doing so.  Anyway, big-O
notation _is_ a measure of worst-case behavior for a given distribution of
inputs.  Don't be fooled because its used in the context of putting an upper
bound on the "worst best case", the "worst average case" and the "worst
worst case" behaviors of an algorithm.

For example, we can say that if the worst case behavior of an algorithm is
O(n) then its average case and best case behaviors must also be O(n).

Why, you ask?

Well, if the worst case time for an algorithm is t_w(n) < a*n + c and the
best time t_b(n) is less than or equal to t_w(n), then it must also be that
t_b < a*n + c, as well.  Thus the best-case time complexity is also O(n). 
This does not say that O(n) is _the best_ upper-bound -- its just saying
that it _is_ an upper-bound.

> If they are using reed solomon, figuring out which bits are wrong requires
> solving simultaneous equations with t unknowns (where 2t=n-k, k is the
> number of data symbols of s bits each, and n is the number of symbols,
> including the new parity symbols (IE 2t+k = n)). 

This is just a smoke-and-mirrors argument.  All of the variables you've
defined are effectively constants (n here isn't the same n as before).  They
have well defined upper bounds defined by the disk geometry!  So even if
computing Reed-Solomon codes are O(m^10*n^10), m and n are constants and do
not change the upper bound by more than a constant factor and do not change
the big-O bound at all.

Anyhow, just thought I'd inject a bit 'o computer science fun into your
life.

Theoretically,
-Kevin

-- 
----------->  Kevin Jacobs  <-----------|------->  (216) 986-0710  <--------
Informatics Consultant                  | Department of Epidemiology
Primary mail:   jacobs at theopalgroup.com |   & Biostatistics
Alternate mail: jacobs at darwin.cwru.edu  | Case Western Reserve University
----------------------------------------------------------------------------