Abuse of Big Oh notation

Tue Aug 21 05:53:45 EDT 2012

On Mon, 20 Aug 2012 11:12:22 -0600, Ian Kelly wrote:

> On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico <rosuav at gmail.com>
> wrote:
>> On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email at nospam.invalid>
>> wrote:
>>> Analogy: how big a box is required to hold a pair of shoes?  In a
>>> purely theoretical sense we might say O(S) where S is the shoe size,
>>> treating shoe size as an arbitrary independent variable.  But in the
>>> real world, shoe size is controlled by the size of human feet, which
>>> is bounded by a constant for biological reasons.  You don't have to
>>> consider shoes the size of Jupiter.  So it is O(1).
>>
>> By that argument, everything is amortized O(1), because there's a limit
>> on every variable. You can't possibly be working with a data set
>> greater than the total sum of storage space in the entire world. That
>> still doesn't mean that bubble sort and heap sort are equivalently
>> efficient.
> 
> The difference between the two is that the former is bounded by a
> constant that is fundamental to the algorithm at hand, whereas the
> latter is bounded only by available resources.  By any practical
> consideration the latter must be considered a variable, but the former
> need not be.
> 
> Paul discusses above the asymptotic growth of a variable as O(S) where S
> is shoe size, but really this measure makes no sense in the first place.
>  The question that this notation seeks to answer is, "What is the big-Oh
> behaviour of this variable as shoe size increases indefinitely (i.e. to
> infinity)?" and the answer in this case is "Shoe size does not increase
> indefinitely"; the question is invalid.
> 
> A more logical question might be, "How much material do I need to
> construct N shoes of size S?"  The answer to this question would
> presumably be some constant factor of N * S**2, which is O(N * S**2).
> Although N can be assumed to vary freely (up to nonsensical quantities
> like the mass of the entire universe), S is clearly bounded by the
> constraints of actual shoes, so we can safely treat S as a constant and
> call it O(N).

Why the hell are we talking about shoe sizes?

Let's not lose sight of the actual question in hand: how expensive is it 
to fetch a character at some arbitrary index in a string?

With fixed-width characters, average time is O(1), best-case time is O(1) 
and worst-case time is O(1).

With non-fixed width characters, average time and worst-case time are 
both O(N), and best-case ("give me the character at index 0") is O(1). 
But that's a degenerate case that gives us absolutely no insight into the 
cost of the operation. It's like the observation that "sorting a list of 
one item takes constant time, regardless of the algorithm". Um, yes, it 
does. So what?

Invented constraints like "people only care about indexes really close to 
zero, or to the end of the string", effectively confuse the best possible 
case for the average case. In real life, people do care about random 
access to strings and the average case is more important than the best 
case. That's why strings are typically implemented as arrays of fixed-
width characters rather than linked lists, and Haskell (which doesn't) 
explicitly warns you that they don't and that your string-handling code 
is likely to be slow.

Of course, you can swap space and complexity for time, e.g. ropes, or use 
cached pointers to character locations in the hope that indexes will be 
clustered rather than scattered randomly. These more complex 
implementations typically use as much memory than, and performance is 
often no better than, a dead simple implementation that uses a simple 
array of fixed width characters. Most harmful, it means that a simple 
indexing operation may be fast on average, and occasionally degenerate to 
slow. The only thing worse than "This program is slow" is "This program 
is *occasionally* slow, and I can't predict when".

Not surprising, almost all programming languages in common usage use 
arrays of fixed-width characters as their native string type. Python 3.3 
will now do the same thing, with the memory optimization that the fixed-
width will be per string and selected from 1, 2, or 4 bytes according to 
the minimum width needed, rather than choosing between 2 and 4 bytes when 
you build the Python compiler.

This is a big positive step forward with a pretty simple algorithm and 
will strongly help encourage the use of Unicode by taking much of the 
sting out of the perennial complaints that "Unicode wastes space". Not so 
wasteful any more.

-- 
Steven