[pypy-issue] [issue771] Major slowdown vs CPython when using sets

Jay Parlar tracker at bugs.pypy.org
Thu Jun 30 05:01:30 CEST 2011


New submission from Jay Parlar <parlar at gmail.com>:

I've got some code that operates on huge tables from a database. When developing, 
certain areas were taking too long with CPython, so I compiled an Oracle-enabled 
version of PyPy (based on 27df060341f0, which I believe merged in a bunch of dict 
and set improvements), and tried that.

The attached code has three main operations. The first ("Adding nodes to graph"), 
sees an enormous speed-up with PyPy. Yay!

The second ("Adding edges to graph") is roughly the same speed with PyPy and 
CPython. Oh well. It's very quick, so not a huge problem.

The third section ("Building top callers") was massively slower with PyPy with my 
initial implementation of its main function, identify_leaf_nodes(). As a test, I 
created a different implementation of this function that uses lists instead of 
sets as the main data-type. This showed a massive improvement in PyPy, and was 
twice as fast as CPython running this same implementation. I should note that the 
fake data included in this script seems to be even worse on this step than with 
my actual data, as it's been running for over 10 minutes thus far on my machine.

I've attached a very stripped-down version of the code. One big difference is I'm 
simulating a large data set, as I can't give access to my Oracle db. It's close 
enough to how my real data looks, and shows the problem.

Running the script with no arguments will use my original implementation of 
identify_leaf_nodes(), now called identify_leaf_nodes_set_implementation(). 
Passing an argument to the script will cause it to use the list-based 
implementation.

For reference, these are the times I found when running, on a very fast 64-bit 
Linux machine with 32GB of RAM:

---------------------------
PyPy, set implementation
---------------------------
Adding nodes to graph
Completed in 2.57755994797
Adding edges to graph
Completed in 1.20558595657
Building top callers
... It's been at this stage for over 10 minutes on my machine

---------------------------
CPython, set implementation
---------------------------
Adding nodes to graph
Completed in 81.4090509415
Adding edges to graph
Completed in 2.46576595306
Building top callers
Completed in 5.50244784355

---------------------------
PyPy, list implementation
---------------------------
Adding nodes to graph
Completed in 2.82616519928
Adding edges to graph
Completed in 1.51707386971
Building top callers
Completed in 2.70521807671

----------------------------
CPython, list implementation
----------------------------
Adding nodes to graph
Completed in 81.5100169182
Adding edges to graph
Completed in 2.52105903625
Building top callers
Completed in 5.27169704437


In case it helps, the goal of this script is to analyze graphs of caller/callee 
relationships. The end goal is to give it a set of callees, and have it report on 
the set of any callers that call at least one of the callees, as well as return 
the full set of all nodes visited in the search. My actual code does much more 
than this, but this minimal example illustrates the performance problem.

----------
files: graphbuilder.py
messages: 2705
nosy: parlarjb, pypy-issue
priority: bug
status: unread
title: Major slowdown vs CPython when using sets

________________________________________
PyPy bug tracker <tracker at bugs.pypy.org>
<https://bugs.pypy.org/issue771>
________________________________________


More information about the pypy-issue mailing list