Help me cythonize a python routine!

Wed Nov 9 18:01:29 EST 2016

On 09/11/2016 21:25, breamoreboy at gmail.com wrote:
> On Wednesday, November 9, 2016 at 7:34:41 PM UTC, BartC wrote:

>> All the real work is done inside the Collections module. If that was
>> written in Python, then you'd have to Cythonise that, and there might be
>> quite a lot of it!
>>
>> But I think 'collections' is a built-in module which means it's already
>> in something like C. So it might already be as fast as it gets (0.7 to
>> 0.8 seconds on my machine), unless perhaps a different algorithm is used.
>
> 'collections' isn't a built-in module, it's part of the stdlib, again showing your complete ignorance of the entire Python setup.  However according to your mindset nothing matters provided it's fast, accuracy does not matter to users.  Hence your recent comment on another thread about converting invalid integer entries into zero.  I weep every time I think about it, which is quite often.

I haven't ruled out that collections is written in Python. But I can't 
find a 'collections.py' module in my Python 3.4; the nearest is 
"__init__.py". And there /is/ a lot of code there.

To the OP: there was a long thread in comp.lang.c about tasks like this 
in a variety of languages:

https://groups.google.com/forum/#!topic/comp.lang.c/n2gvRytNblI%5B326-350%5D

The first solution that corresponds exactly to your task (In Python but 
with a different method) is at the bottom of page 14; 19-Dec-15.

I posted part of a version in compiled code that runs in about 70msec 
(the fastest PyPy was 600msec). So some speed-up /is/ possible with the 
right approach. The method I used was something like this:

  - Read the entire file into memory (eg. as string or byte-array)

  - Initialise a hashtable D (ie. dict) to empty, where each entry
    is a word, and a count

  - Scan that memory data recognising words

  - For each word, look it up in D. Increment its count if found;
    add it with a count of 1 if not

  - At the end, rather than sort, just scan D 20 times looking for
    the word with the nth highest count (n is 1 to 20), and display.

This can be written in pure Python for the algorithm part, without 
needing to use 'collections', only dicts. That could offer an easier 
opportunity to apply Cython tweaks.

-- 
Bartc