[Python-Dev] Yet another "A better story for multi-core Python" comment

Trent Nelson trent at snakebite.org
Wed Sep 9 22:33:49 CEST 2015


On Tue, Sep 08, 2015 at 10:12:37AM -0400, Gary Robinson wrote:
> There was a huge data structure that all the analysis needed to
> access. Using a database would have slowed things down too much.
> Ideally, I needed to access this same structure from many cores at
> once. On a Power8 system, for example, with its larger number of
> cores, performance may well have been good enough for production. In
> any case, my experimentation and prototyping would have gone more
> quickly with more cores.
>
> But this data structure was simply too big. Replicating it in
> different processes used memory far too quickly and was the limiting
> factor on the number of cores I could use. (I could fork with the big
> data structure already in memory, but copy-on-write issues due to
> reference counting caused multiple copies to exist anyway.)

This problem is *exactly* the type of thing that PyParallel excels at,
just FYI.  PyParallel can load large, complex data structures now, and
then access them freely from within multiple threads.  I'd recommended
taking a look at the "instantaneous Wikipedia search server" example as
a start:

https://github.com/pyparallel/pyparallel/blob/branches/3.3-px/examples/wiki/wiki.py

That loads trie with 27 million entries, creates ~27.1 million
PyObjects, loads a huge NumPy array, and has a WSS of ~11GB.  I've
actually got a new version in development that loads 6 tries of the
most frequent terms for character lengths 1-6.  Once everything is
loaded, the data structures can be accessed for free in parallel
threads.

There are more details regarding how this is achieved on the landing
page:

https://github.com/pyparallel/pyparallel

I've done a couple of consultancy projects now that were very data
science oriented (with huge data sets), so I really gained an
appreciation for how common the situation you describe is.  It is
probably the best demonstration of PyParallel's strengths.

> Gary Robinson garyrob at me.com http://www.garyrobinson.net

    Trent.


More information about the Python-Dev mailing list