A strange statement in the bisect documentation?

Fri Mar 6 21:24:11 EST 2015

Dmitry Chichkov wrote:

> I was looking over documentation of the bisect module and encountered the
> following very strange statement there:
> 
> From https://docs.python.org/2/library/bisect.html
> 
> ...it does not make sense for the bisect() functions to have key or
> reversed arguments because that would lead to an inefficient design
> (successive calls to bisect functions would not "remember" all of the
> previous key lookups).
> 
> Instead, it is better to search a list of precomputed keys to find the
> index of the record in question...
> 
> 
> Is that right that the documentation encourages to use O(N) algorithm [by
> making a copy of all keys] instead of using O(log(N)) bisect with
> kay=attrgetter?  And claims that O(N) is more efficient than O(log(N))?

Apparently :-)

The documentation may not be completely clear, but what it is arguing is
this:

If you are making repeated bisect calls, then using a key function is
inefficient because the key gets lost after each call and has to be
recalculated over and over again. A concrete example:

data = [i/100 for i in range(1, 7000001, 7)]
data.sort(key=str)

for x in many_x_values:
    bisect.insort(data, x, key=str)  # Pretend this works.

The first time you call insort, the key function (str in this case) will be
called O(log N) times. The second time you call insort, the key function
must be called again, even for the same data points, because the keys
str(x) are thrown away and lost after each call to insort.

After M calls to insort, there will have been on average M*(log(N)+1) calls
to the key function. (The +1 is because the x gets str'ed as well, and
there are M loops.)

As an alternative, suppose we do this:

data = [i/100 for i in range(1, 7000001, 7)]
data.sort(key=str)
keyed_data = [str(x) for x in data]

for x in many_x_values:
    key = str(x)
    pos = bisect.bisect(keyed_data, key)
    keyed_data.insert(pos, key)
    data.insert(pos, x)

This costs N calls to str in preparing the keyed_data list, plus M calls to
str in the loop. The documentation suggests that we consider this will be
cheaper in the long run, in other words:

N + M < M*(log(N) + 1)

N + M < M + M*log(N)

That's true whenever N < M*log(N). 

The documentation assumes we should care about large N and large M. As they
say, "everything is fast for small N": if you are doing three bisections on
a list with ten items, then *who cares* how you do it, it's going to be
quite fast. So let's say you have a list of 10000 items, that is N = 10000.
Then it is faster to build a second keyed_list when:

10000 < M*log(10000) = M*13  # approximately

i.e. M > 770 give or take. That's a fairly small number, and it's only about
8% of the size of N.

So the general perspective is:

If you have a list of N items, and you call bisect more than about (10% of
N) times, it will likely be faster to pre-prepare a keyed list than call a
key function repeatedly for each call to bisect.

If you call bisect less than (10% of N) times, then it doesn't matter,
because that's comparatively a small number and will be fast enough either
way.

Is the documentation correct? I don't know. I don't have an opinion on this.
I recommend that you fork the bisect module, call it mybisect, add a key
function parameter, and compare the speed of the two versions, with or
without a key function. Remember:

- Recent versions of bisect have a C accelerated version. Make sure you
disable that, so you are comparing the speed of Python versus Python not
Python versus C.

- Nobody will care about small lists. I suggest that you start with at least
ten thousand items, and move up to ten billion or so. For each N in (10**4,
10**5, ... 10**10) find out how many calls to insort or bisect does it take
before the key function version becomes slower. Or possibly faster.

- Is there are difference between fast key functions like str, and slow ones
that have to do a lot of processing?

-- 
Steven