testing for uniquness in a large list

Wed Oct 20 15:17:44 EDT 2004

"bearophile" <bearophileHUGS at lycos.com> wrote in message
news:5c882bb5.0410200830.471a5ea9 at posting.google.com...
> Alex Martelli's solution is *very* good, but there is a sampling
> problem: putting a simple printing inside his program:
>
> if not (len(results) % 5000):
>     print len(results)
>
> You can see that it slows down a lot when the dictionary contains
> about 100000-120000 different sequences, because there are many
> collisions, and it keeps slowing down. Probably a little speed up of
> this code cannot help. A different algoritm can be useful.
> I don't know... direct sequences generation doesn't seem possibile.
> Maybe a partially direct generation can be okay.
>
> Hugs,
> bearophile

Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so.  There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul