A strange statement in the bisect documentation?

Mon Mar 9 14:03:53 EDT 2015

Steven,

I'm somewhat argeeing with you, regarding the general analysisn, yet I'm not quite sure about your analysis of the repeated bisect call code. In particular, in the sample that you've given:

data = [i/100 for i in range(1, 7000001, 7)]
data.sort(key=str)
keyed_data = [str(x) for x in data]

for x in many_x_values:
    key = str(x)
    pos = bisect.bisect(keyed_data, key)
    keyed_data.insert(pos, key)
    data.insert(pos, x)

As per Python documentation "doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one)." so the complexity here would be O(M*N)  [+ O(M*log(N))].

So this particular example looks like micro-optimization around bisect(), while loosing the big-picture out of sight. That inserts into the list are O(N) and in general are more expensive than O(log(N)) bisect calls, doing two inserts (into keyed_data and data) is worse than doing one, storing and maintaining an extra list of keys is relatively expensive, and some times plain impossible.

The following statement that agrees with the documentation also doesn't seem to be very sound "If you call bisect less than (10% of N) times, then it doesn't matter, because that's comparatively a small number and will be fast enough either way. ".

If the N is large (say 10000000000 records) even if you need to call bisect only once or twise, the speed may still matter a lot. Take an example of a container on a timestamped 1TB log file, where you need to navigate to a perticular record. It would be a very slow operation indead, to create a copy of all keys, like the documentation suggests. Yet it would be very fast, to just bisect to the right element, taking just O(ln(10000000000)) ~ 34 reads.

So to me, it looks like the argument  "it does not make sense for the bisect() functions to have key or reversed arguments because that would lead to an inefficient design"   documentation is incorrect in general, because it suggest to do O(N) instead of O(log(N)) operation, to optimize a hypothetical scenario.

To give examples, note, how it led you to an inefficient design in your sample code of doing *two* .insert() calls in your code, rather than just one. And I can easily see people writing something like .bisect([el.time for el in container]) just beceuse they don't expect the container to ever become large. 

On Friday, March 6, 2015 at 6:24:24 PM UTC-8, Steven D'Aprano wrote:
> Dmitry Chichkov wrote:
> 
> > I was looking over documentation of the bisect module and encountered the
> > following very strange statement there:
> > 
> > From https://docs.python.org/2/library/bisect.html
> > 
> > ...it does not make sense for the bisect() functions to have key or
> > reversed arguments because that would lead to an inefficient design
> > (successive calls to bisect functions would not "remember" all of the
> > previous key lookups).
> > 
> > Instead, it is better to search a list of precomputed keys to find the
> > index of the record in question...
> > 
> > 
> > Is that right that the documentation encourages to use O(N) algorithm [by
> > making a copy of all keys] instead of using O(log(N)) bisect with
> > kay=attrgetter?  And claims that O(N) is more efficient than O(log(N))?
> 
> Apparently :-)
> 
> 
> The documentation may not be completely clear, but what it is arguing is
> this:
> 
> If you are making repeated bisect calls, then using a key function is
> inefficient because the key gets lost after each call and has to be
> recalculated over and over again. A concrete example:
> 
> data = [i/100 for i in range(1, 7000001, 7)]
> data.sort(key=str)
> 
> for x in many_x_values:
>     bisect.insort(data, x, key=str)  # Pretend this works.
> 
> 
> The first time you call insort, the key function (str in this case) will be
> called O(log N) times. The second time you call insort, the key function
> must be called again, even for the same data points, because the keys
> str(x) are thrown away and lost after each call to insort.
> 
> After M calls to insort, there will have been on average M*(log(N)+1) calls
> to the key function. (The +1 is because the x gets str'ed as well, and
> there are M loops.)
> 
> 
> 
> As an alternative, suppose we do this:
> 
> data = [i/100 for i in range(1, 7000001, 7)]
> data.sort(key=str)
> keyed_data = [str(x) for x in data]
> 
> for x in many_x_values:
>     key = str(x)
>     pos = bisect.bisect(keyed_data, key)
>     keyed_data.insert(pos, key)
>     data.insert(pos, x)
> 
> 
> This costs N calls to str in preparing the keyed_data list, plus M calls to
> str in the loop. The documentation suggests that we consider this will be
> cheaper in the long run, in other words:
> 
> N + M < M*(log(N) + 1)
> 
> N + M < M + M*log(N)
> 
> 
> That's true whenever N < M*log(N). 
> 
> The documentation assumes we should care about large N and large M. As they
> say, "everything is fast for small N": if you are doing three bisections on
> a list with ten items, then *who cares* how you do it, it's going to be
> quite fast. So let's say you have a list of 10000 items, that is N = 10000.
> Then it is faster to build a second keyed_list when:
> 
> 10000 < M*log(10000) = M*13  # approximately
> 
> i.e. M > 770 give or take. That's a fairly small number, and it's only about
> 8% of the size of N.
> 
> So the general perspective is:
> 
> If you have a list of N items, and you call bisect more than about (10% of
> N) times, it will likely be faster to pre-prepare a keyed list than call a
> key function repeatedly for each call to bisect.
> 
> If you call bisect less than (10% of N) times, then it doesn't matter,
> because that's comparatively a small number and will be fast enough either
> way.
> 
> 
> Is the documentation correct? I don't know. I don't have an opinion on this.
> I recommend that you fork the bisect module, call it mybisect, add a key
> function parameter, and compare the speed of the two versions, with or
> without a key function. Remember:
> 
> - Recent versions of bisect have a C accelerated version. Make sure you
> disable that, so you are comparing the speed of Python versus Python not
> Python versus C.
> 
> - Nobody will care about small lists. I suggest that you start with at least
> ten thousand items, and move up to ten billion or so. For each N in (10**4,
> 10**5, ... 10**10) find out how many calls to insort or bisect does it take
> before the key function version becomes slower. Or possibly faster.
> 
> - Is there are difference between fast key functions like str, and slow ones
> that have to do a lot of processing?
> 
> 
> 
> -- 
> Steven