List Count

Tue Apr 23 13:45:41 EDT 2013

On 23 April 2013 17:57, Blind Anagram <blindanagram at nowhere.org> wrote:
> On 23/04/2013 15:49, Steven D'Aprano wrote:
>> On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:
>>
>>> I did a lot of work comparing the overall performance of the sieve when
>>> using either lists or arrays and I found that lists were a lot faster
>>> for the majority use case when the sieve is not large.
>>
>> And when the sieve is large?
>
> I don't know but since the majority use case is when the sieve is small,
> it makes sense to choose a list.

That's an odd comment given what you said at the start of this thread:

Blind Anagram wrote:
> I would be grateful for any advice people can offer on the fastest way
> to count items in a sub-sequence of a large list.
>
> I have a list of boolean values that can contain many hundreds of
> millions of elements for which I want to count the number of True values
> in a sub-sequence, one from the start up to some value (say hi).

>> I don't actually believe that the bottleneck is the cost of taking a list
>> slice. That's pretty fast, even for huge lists, and all efforts to skip
>> making a copy by using itertools.islice actually ended up slower. But
>> suppose it is the bottleneck. Then *sooner or later* arrays will win over
>> lists, simply because they're smaller.
>
> Maybe you have not noticed that, when I am discussing a huge sieve, I am
> simply pushing a sieve designed primarily for a small sieve lengths to
> the absolute limit.  This is most definitely a minority use case.
>
> In pushing the size of the sieve upwards, it is the slice operation that
> is the first thing that 'breaks'.  This is because the slice can be
> almost as big as the primary array so the OS gets driven into memory
> allocation problems for a sieve that is about half the length it would
> otherwise be. It still works but the cost of the slice once this point
> is reached rises from about 20% to over 600% because of all the paging
> going on.

You keep mentioning that you want it to work with a large sieve. I
would much rather compute the same quantities with a small sieve if
possible. If you were using the Lehmer/Meissel algorithm you would be
able to compute the same quantity (i.e. pi(1e9)) using a much smaller
sieve with 30k items instead of 1e9. that would fit *very* comfortably
in memory and you wouldn't even need to slice the list. Or to put it
another way, you could compute pi(~1e18) using your current sieve
without slicing or paging. If you want to lift the limit on computing
pi(x) this is clearly the way to go.

>
> The unavailable list.count(value, limit) function would hence allow the
> sieve length to be up to twice as large before running into problems and
> would also cut the 20% slice cost I am seeing on smaller sieve lengths.
>
> So, all I was doing in asking for advice was to check whether there is
> an easy way of avoiding the slice copy, not because this is critical,
> but rather because it is a pity to limit the performance because Python
> forces a (strictly unnecessary) copy in order to perform a count within
> a part of a list.
>
> In other words, the lack of a list.count(value, limit) function makes
> Python less effective than it would otherwise be.  I haven't looked at
> Python's C code base but I still wonder if there a good reason for NOT
> providing this?

If you feel that this is a good suggestion for an improvement to
Python consider posting it on python-ideas. I wasn't aware of the
equivalent functionality on strings but I see that the tuple.count()
function is the same as list.count().

Oscar