List Count

Tue Apr 23 07:45:55 EDT 2013

On 23/04/2013 12:08, Oscar Benjamin wrote:
> On 23 April 2013 08:05, Blind Anagram <blindanagram at nowhere.org> wrote:
>> On 23/04/2013 00:01, Steven D'Aprano wrote:
>>> On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:
>>>
>>>> But when using a sub-sequence, I do suffer a significant reduction in
>>>> speed for a count when compared with count on the full list.  When the
>>>> list is small enough not to cause memory allocation issues this is about
>>>> 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
>>>> memory allocation becomes an issue and the cost on my system rises to
>>>> over 600%.
> [snip]
>>>
>>> Another solution might be to use arrays rather than lists. Since your
>>> sieve list is homogeneous, you could possibly use an array of 1 or 0
>>> bytes rather than a list of True or False bools. That would reduce the
>>> memory overhead by a factor of four, and similarly reduce the overhead of
>>> any copying:
>>
>> I did a lot of work comparing the overall performance of the sieve when
>> using either lists or arrays and I found that lists were a lot faster
>> for the majority use case when the sieve is not large.
> 
> Okay, now I understand. I thought you were trying to optimise for
> large lists, but in fact you have already optimised for small lists.
> As a result you have made algorithmic choices that don't scale very
> well. Since you're still concerned about performance on small lists
> you don't want to rethink those choices. Instead you want a
> micro-optimisation that would compensate for them.
> 
> Elsewhere you said:
> 
>> I would not dream of doing this job by copying a slice in any other
>> language that I have used so I was simply asking for advice to discover
>> whether this copy could be avoided whilst staying with the simple sieve
>> design.
> 
> So you already knew that there would be problems with this method, but
> you've chosen it anyway since it turned out to be fastest for small
> lists. You could always just do a different thing when the list is
> large:

Your analysis of my rationale is sound except that I only found out that
I had a problem with counting in a subset of a list when I actually
tried this for a huge sieve.

It was only then that I discovered that there was no way of setting the
upper limit in list.count(x) and hence that I would have to create a
slice or find an alternative approach.

I then wondered why count for lists has no limits whereas count for
other objects (e.g. strings) has these.  I also wondered whether there
was an easy way of avoiding the slice, not that this is critical, but
rather because it is just nice to have a sieve that still actually works
for huge values, albeit inefficiently.

> def pi(self, n):
>   if n < 1000000:
>     return self.indicator[:n].sum()
>   else:
>     return sum(itertools.islice(self.indicator, n))
>

I have looked at itertools.islice as a complete replacement for count()
where, on average, it was a lot slower. But I have not tried hybrid
strategies as you suggest - I'll take a look at this.

My thanks to you and others for the advice that has been offered.

   Brian