Binary search tree

Mon Nov 12 04:17:28 EST 2007

On 11/12/07, Michel Albert (exhuma at gmail.com) wrote:

>On Nov 9, 11:45 pm, Bruno Desthuilliers
><bdesth.quelquech... at free.quelquepart.fr> wrote:
>> maxim.no... at gmail.com a Ècrit :
>>
>> > Hi,
>>
>> > I have to get list of URLs one by one and to find the URLs that I have
>> > more than one time(can't be more than twice).
>>
>> > I thought to put them into binary search tree, this way they'll be
>> > sorted and I'll be able to check if the URL already exist.
>>
>> What about a set ?
>>
>> s = set()
>> for url in urls:
>>    if url in s:
>>      print "already have ", url
>>    else:
>>      set.add(url)
>
>Interesting. For this case I usually used dicts. As in:
>
>d = {}
>for url in urls:
>   if url in d:
>      print "already have ", url
>   else:
>      d[url] = 1
>
>Now, I can see that this method has some superfluous data (the `1`
>that is assigned to the dict). So I suppose this is less memory
>efficient. But is this slower then? As both implementations use hashes
>of the URL to access the data. Just asking out of curiosity ;)

I'm pretty new at python but found this recipie in a cookbook (thanks O'Reilly for a useful book). I also found it online later, when the cookbook wasn't handy:

    <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66516>
    <http://mail.python.org/pipermail/tutor/2005-March/036966.html>
    ... google for more

It uses the 'setdefault' function and works _really_ well.

    <http://docs.python.org/lib/typesmapping.html>

Basically looks like this:

    unique_urls = {}
    for url in urls:
        unique_urls.setdefault(url)

    ['www.somewhere.com', 'www.somewherelse.com', 'www.somewherelse.com']

... creates a dict of None's*
    {'www.somewhere.com': None, 'www.somewherelse.com': None}

and "harvesting" is trivial:
    url_keys    = unique_urls.keys()

If you would find an index and count useful, you could do something like this:

    for idx in range(len(urls)):
        unique_urls.setdefault(urls[idx],[]).append(idx)
...
   {'www.somewhere.com': [0], 'www.somewherelse.com': [1,2]}

later, the url's, their index values and counts could be found by:

    url_keys    = unique_urls.keys()
    for url_key in url_keys:
        url_idxs    = unique_urls[url_key]
        url_hits    = len(url_idxs)

Depending upon the data you may have associated with your URLs, or whatever else you're working with, this is a very sweet 'shortcut'.

I hope this helps,

Scott

* To be trully Python-esque, 'habits' would be employed and the value would have been 'Nun', but I guess the namesake can only be carried so far (^8