[Tutor] Python Extensions in C

Thu May 26 23:28:47 CEST 2011

James Reynolds, 26.05.2011 21:34:
> On Thu, May 26, 2011 at 3:07 PM, Stefan Behnel wrote:
>> Stefan Behnel, 26.05.2011 18:10:
>>   James Reynolds, 26.05.2011 17:22:
>>>
>>>> As an intellectual exercise, I wanted to try my hand at writing some
>>>> extensions in C.
>>>
>>> This is fine for en exercise, and I hope you had fun doing this.
>>>
>>> However, for real code, I suggest you use Cython instead. Your module
>>> would have been substantially simpler and likely also faster.
>>>
>>> http://cython.org
>>
>> Oh, and one more thing: it makes it easier to write safe, portable and
>> versatile code. As others have pointed out, your code has unnecessary bugs.
>> It also doesn't compile in Python 3 and lacks the ability to calculate the
>> averages of a set or deque, for example. Instead, it only handles tuples and
>> lists. That reduces the usefulness of your implementation.
>
> Thank you for point out the above. Could you kindly please point out some of
> those unnecessary bugs?

Alan gave a good response here. For example, PyFloat_AsDouble() can fail, 
you need to handle that.

Your code also does not correctly sum up floating point numbers. See 
math.fsum() for that.

Here's a possible Cython version of the variance() function:

   from math import fsum

   def variance(seq):
       "Calculate the variance of all FP numbers in a sequence."
       cdef double value, dsum
       cdef Py_ssize_t count = len(seq)

       dsum = fsum(seq)
       average = dsum / count

       return fsum([(average - value) ** 2 for value in seq]) / count

You can avoid the list comprehension (read: creation) in the second call by 
using a generator expression (remove the angular brackets). However, this 
is likely to run slower (and requires Cython 0.15 ;), so we are trading 
memory for speed here.

If you want this implementation to run faster, you need to unfold and 
inline fsum's algorithm, which you can look up in the CPython sources.

For comparison, here is a less accurate version that does not use fsum() 
but your own algorithm:

   def variance(seq):
       cdef double value, dsum, sqsum
       cdef Py_ssize_t count = len(seq)

       dsum = 0.0
       for value in seq:
           dsum += value
       average = dsum / count

       sqsum = 0.0
       for value in seq:
           sqsum += (average - value) ** 2
       return sqsum / count

Note that both implementations cannot work with iterators as input as they 
iterate twice. If you want to support that, you can add

     seq = list(seq)

at the start, which will let us trade memory for this feature. An 
additional type test condition for seq not being a list or tuple will give 
you more or less what PySequence_Fast() does.

> I'm not sure that it matters that it won't work in sets and deque's, so long
> as the documentation is clear, no?

It will matter to the users of your code at some point.

> (Which I'm still not sure how to do, just yet)

What do you mean? Doc strings? You can just add them to the module function 
struct:

http://docs.python.org/py3k/extending/extending.html#the-module-s-method-table-and-initialization-function

> But, I did test it for sets and deque and it works just fine.
>
> setx = set([])
> print type(setx)
> for i in r:
>      setx.add(random.choice(r))
> print stats.mean(setx)
> dequex = deque([])
> print type(dequex)
> for i in r:
>      dequex.append(random.choice(r))
> print stats.mean(dequex)

Right. I keep forgetting that PySequence_Fast() actually has a fallback 
that copies the iterable into a tuple. So you are also trading memory here 
to make it work for all iterable types. Personally, I'd never use the 
PySequence_Fast*() API because it isn't really all that fast. It's just 
faster than normal Python iteration for tuples and lists and also avoids 
copying in that case. All other cases are worse than generic iteration, and 
it's not faster than Cython's looping code.

> I'll see what I can do about making it work with P3k, I think the only thing
> that would need to be changed would be "PyMODINIT_FUNC initstats(void)" I
> believe. Please correct me if I'm wrong though.

You need to create a module struct and use a different name for the init 
function. Another thing that Cython code doesn't have to care about.

Stefan