finding items that occur more than once in a list

Fri Mar 21 16:39:09 EDT 2008

On Mar 22, 1:11 am, castiro... at gmail.com wrote:
> On Mar 21, 4:17 am, Bryan Olson <fakeaddr... at nowhere.org> wrote:
>
>
>
> > Arnaud Delobelle wrote:
> > > Bryan Olson wrote:
> > [...]
> > >> Arnaud Delobelle offered a good Wikipedia link, and for more background
> > >> look up "amortized analysis.
>
> > > Hrvoje Niksic provided the link :).
>
> > Oops, careless thread-following. Hrvoje Niksic it was.
>
> > > I still think two unrelated
> > > things are being compared in this thread when people say that method A
> > > (using dictionaries / sets) is O(n) and method B (sorting the list)
> > > O(nlogn).
>
> > > Isn't it the case that:
>
> > >          | Worst case | Average case
> > > ---------|------------|-------------
> > > Method A | O(n^2)     | O(n)
> > > Method B | O(nlogn)   | O(nlogn)
>
> > > Which method is the best would then be determined by the distribution
> > > of the hash values of your data, and whether you want to have a
> > > guarantee the method will have a less-than-quadratic behaviour.
>
> > If we exclude the case where an adversary is choosing the keys, the
> > chance of a seriously degenerate case in the hashing method is so
> > remote that we do should not worry about it. Those who insist on
> > ranking by worst-case behavior have probably abandoned Python long
> > ago, as it uses those hashed 'dict' things everywhere. Of course
> > they've also abandoned OS's with demand-paged virtual memory and
> > such.
>
> > --
> > --Bryan
>
> A collision sequence is not so rare.
>
> >>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
>
> [1, 1, 1, 1, 1, 1, 1, 1]

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."