Binary search tree

Tue Nov 13 04:54:31 EST 2007

On 11/13/07, Terry Reedy (tjreedy at udel.edu) wrote:

>"Scott SA" <pydev at rscorp.ab.ca> wrote in message 
>news:r02010500-1049-7D029806915411DC919E001124DEBE0E@[192.168.69.99]...
>| On 11/12/07, Scott SA (pydev at rscorp.ab.ca) wrote:
>| I decided to test the speeds of the four methods:
>|
>|    set_example
>|        s = set()
>|        for url in urls:
>|            if not url in s:
>|                s.add(url)
>
>How is that different in result from

As mentioned in my earlier post, I'm fairly new to Python so many options are unknown to me (even then, there are often new ideas or approaches even for seasoned coders, I'm not looking forward to the days when I stop learning ;-).

>set_example2
>   s = set(urls)

Well, that's pretty sweet. Thanks for the heads-up on that!

>which would be much faster,  faster, I would expect, than

Please see below, it is the 'set2' result

>|    dict_example
>|        d = {}
>|        for url in urls:
>|            if url in d:
>|                d[url] = 1

    Note: 
    There was an erorr in this, it was missing a 'not'
    so it should have been:

       dict_example
           d = {}
           for url in urls:
               if not url in d:
                   d[url] = 1

>|    Starting tests with 500000 simulated URLs run 3 times.
>|    ------------------------------------------------------------
>|    'set' example
>|                        run time: 0.5505 seconds
>|    'dict' example
>|                        run time: 0.2521 seconds

    Starting tests with 500000 simulated URLs run 3 times.
    ------------------------------------------------------------
    'set' example
                        run time: 0.8626 seconds
                        5000 unique URLs
    'set2' example
                        run time: 0.4896 seconds
                        5000 unique URLs
    'dict' example
                        run time: 0.4296 seconds
                        5000 unique URLs
    'setdefault' example
                        run time: 0.9723 seconds
                        5000 unique URLs
    ------------------------------------------------------------

Interestingly, the 'dict' method is still the champ for the simulations I created. Of course a different ratio of duplication within the source list could have an effect on performance, size of list (and I'm sure other environmental circumstances) would affect mileage.

Since my earlier posts, I found and re-read the original post:

    >I have to get list of URLs one by one and to find the
    >URLs that I have more than one time(can't be more than
    >twice).

My re-interpretation of his needs were that there could be "no more than two" URL's. This means to me that not only did he require the unique 'set' but also needed to _count_ URL hits and subsequently process any overages.

From my testing, the following _seems_ to be the best performer:

    def dictcount_example(urls):
        d = {}
        for url in urls:
            if url in d:
                d[url] += 1
                if d[url] > 2:
                    proc_overage(url) # handle overage
            else:
                d[url] = 1
        return d

Without the 'if' statement (I added it for functional completeness), it performed as follows:

Starting tests with 500000 simulated URLs run 3 times.
    ------------------------------------------------------------
    'dictcount' example
                        run time: 1.1556 seconds
                        5000 unique URLs
    ------------------------------------------------------------

... Curiosity got me going, so I reduced the numbers of redundancies from 100 to 5 and the unique URLs to 100,000. The results are:

Starting tests with 500000 simulated URLs run 3 times.
    ------------------------------------------------------------
    'set' example
                        run time: 1.1986 seconds
                        100000 unique URLs
    'set2' example
                        run time: 0.5986 seconds
                        100000 unique URLs
    'dict' example
                        run time: 0.8676 seconds
                        100000 unique URLs
    'dictcount' example
                        run time: 1.3631 seconds
                        100000 unique URLs
    ------------------------------------------------------------

So, the set() functionality in Python appears better with smaller amm'ts of redundancy. Still, evaluating 1/2 mil. URLs in sub-second timeframes is a respectable, in my opinion, pace. Esp. when one considers the call is _one_ line!

Scaled to 2,000,000 URLs total and 1,000,000 unique URLs, the timings are interesting:

    Starting tests with 2000000 simulated URLs run 3 times.
    ------------------------------------------------------------
    'set' example
                        run time: 7.9292 seconds
                        1000000 unique URLs
    'set2' example
                        run time: 4.6351 seconds
                        1000000 unique URLs
    'dict' example
                        run time: 5.0060 seconds
                        1000000 unique URLs
    'dictcount' example
                        run time: 6.2595 seconds
                        1000000 unique URLs
    ------------------------------------------------------------

For simply extracting a unique list, the set(urls) and dict examples are pretty close and dramatically out-pace my original post describing dictionary 'setdefault' functionality.

Thanks again for the heads-up about 'set()', that will come in handy (esp. for simplicity of code!).

Scott