[Tutor] Lists of duplicates

Thu Mar 9 08:54:22 EST 2017

On Thu, Mar 09, 2017 at 01:26:26AM +0530, Sri Kavi wrote:

> I wrote and submitted the following function:
> 
> def sublist_duplicates(lst):
>     sublists = []
>     for item in set(lst):
>         sublists.append([item] * lst.count(item))
>     return sublists
> 
> lst_of_duplicates = [1, 2, 10, 3, 4, 1, 's', 2, 3, 1, 4, 's']
> 
> print sublist_duplicates(lst_of_duplicates)
> 
> A participant told me that that'll grind into a near-halt if there are many
> different elements, for example range(1_000_000) (1 million unique values,
> so it checks the whole list (length 1 million) for the count, a million
> times, a total of 1000 billion operations…

Correct. lst.count(item) has to walk the entire list, counting elements. 
It does that in C code, so it is fast, but still, if there are a lot of 
elements to walk, it takes some time. Then you do it again, and again, 
and again...

For a list with N items, your sublist_duplicates function walks the 
entire list N*N times. When N is large, N**2 is even larger.

> I thought he was right. I wrote another version of the function, this time
> using collections.Counter.
> 
> from collections import Counter
> 
> def sublist_duplicates(lst):
>     counts = Counter(lst)
>     sublists = [[k] * counts[k] for k in counts]
>     return sublists
> 
> _lst = list(range(1000000))
[...]

> I found that the first version works better with a small list,

Yes, but for a small list the total time is likely to be so small that 
it doesn't really matter.

> but the
> second version outperforms the first one when it’s given a considerably big
> list of data.
> 
> 
> I’m asking for your comments on these two functions. I’m not even sure if
> this is how it should be coded.

I agree with your analysis.

-- 
Steve