[Tutor] Binary search question

Fri Apr 23 18:40:53 CEST 2010

On Fri, Apr 23, 2010 at 5:29 PM, Emile van Sebille <emile at fenx.com> wrote:
>
>
> It's expensive enough that for a list this size I'd convert it to a dict and
> use in on that.  eg,
>
> a = range(100000)
> d = dict(zip(a,a))
>
> 55555 in d
>

you will want to nuance that statement just a little. If you're going
to perform just a few key lookups, the process of creating the dict
will negate any performance advantage. With many key lookups it will
be a win, of course, because O(1) lookup saves you a lot over O(n).

Here's some timings of different methods:

$ python prof.py
starting timing...
timing list: [7.5448801517486572, 6.4503600597381592, 6.1473228931427002]
timing bisect: [0.012073993682861328, 0.011871099472045898,
0.011121988296508789]
timing frozenset (one lookup): [9.3203139305114746,
15.974851131439209, 10.923642873764038]
timing dict (one lookup)": [32.903657913208008, 38.234655857086182,
42.381407976150513]
timing frozenset (many lookups): [0.0022511482238769531,
0.0021729469299316406, 0.0021619796752929688]
timing dict (many lookups): [0.0048539638519287109,
0.0043039321899414062, 0.0047240257263183594]

each one times 10.000 lookups of the number 55555 into a sorted list
(range(10000)), using different methods:

* list: use the 'in' operator on the list
* bisect: use the bisection algorithm from the std lib.
* frozenset: create a frozenset from the list and use that for lookup.
* dict: create a dict from the list and use that for lookup.

The frozenset and dict methods are done in two differen ways: creating
it once and using it for all lookups, and creating a new one for every
lookup (simulating the case where you only need to do a single lookup
into the list)

conclusions:
* if you only need to do one lookup into the list, use bisect if you
can. It's the better algorithm. Your second-best option is the 'in'
operator. creating a dict or frozenset is too expensive.
* when you're going to do many lookups on a list, creating a frozenset
from it is the fastest option. It's cheaper to create than a
dictionary, and lookups are O(1).

caveats:
* timings apply to a sorted list. if your list is unsorted and you
only need a single lookup, the simple lookup may be better than a
sort/bisect. for multiple lookups, frozenset will likely still be the
fastest method.
* the number chosen to look up is not present in the list,
guarantueeing worst case performace for list.__contains__. Timings
will be different for a number present in the list (when looking up
the first element, performace should be very good indeed). However,
you should always account for the worst case IMHO

for the curious, the timing script I used is up here:
http://gist.github.com/376767

HTH,
Hugo