testing for uniquness in a large list

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed Oct 20 15:19:34 EDT 2004


"Paul McGuire" <ptmcg at austin.rr._bogus_.com> wrote in message
news:sRydd.12955$sO5.12002 at fe1.texas.rr.com...
> "bearophile" <bearophileHUGS at lycos.com> wrote in message
> news:5c882bb5.0410200830.471a5ea9 at posting.google.com...
> > Alex Martelli's solution is *very* good, but there is a sampling
> > problem: putting a simple printing inside his program:
> >
> > if not (len(results) % 5000):
> >     print len(results)
> >
> > You can see that it slows down a lot when the dictionary contains
> > about 100000-120000 different sequences, because there are many
> > collisions, and it keeps slowing down. Probably a little speed up of
> > this code cannot help. A different algoritm can be useful.
> > I don't know... direct sequences generation doesn't seem possibile.
> > Maybe a partially direct generation can be okay.
> >
> > Hugs,
> > bearophile
>
> Is part of this slowdown the fact that the loop will exit only when rlen
> reaches 200,000, when you have just shown that it can never be greater
than
> 125,970?
>
> Perhaps some heuristic would opt for using sets instead of random.sample
if
> rlen approaches some threshold value of
> len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so.  There should be
a
> crossover point at which it is cheaper to build a set of all 125,970
> combinations, and then select 100,000 of them using set.pop() (which
selects
> an arbitrary element from the set, and removes it).
>
> -- Paul
>
>
... and of course, raising a MathError if rlen >
len(seq)!/picklen!(len(seq)-picklen)!

-- Paul





More information about the Python-list mailing list