[pypy-dev] Program slower on Pypy 7.3.3 (3.7.9) than CPython 3.9.

Tue Mar 16 05:27:20 EDT 2021

On 3/15/21 11:16 PM, Dan Stromberg wrote:
>
> And it's opensource, though many of the inputs are licensed.
>
> The code is at https://stromberg.dnsalias.org/~strombrg/music-pipeline/
> <https://stromberg.dnsalias.org/~strombrg/music-pipeline/>
> (https://stromberg.dnsalias.org/svn/music-pipeline/trunk/
> <https://stromberg.dnsalias.org/svn/music-pipeline/trunk/>)
>
> It appears to be more than 10x slower.
>
> I haven't profiled it yet.  I believe it's probably the "Blocklisting
> files..." part that's slow.  That part is O(n*m), and seems to take
> forever.  It's heavy on regular expressions.
>
> Are regular expressions expected to be slow on Pypy3?

Hi Dan,

Interesting problem! single regular expressions are reasonably fast on
PyPy, being jitted. But I don't think we looked into the problem of
"what if you have thousands of them" before. Your reproducer is hitting
a kind of known, hard to fix corner case of the JIT, it's actually
producing a linear search over the existing regular expressions for
every match call in this case, with catastrophic consequences. It's on
my mid-term plans to work on this problem, but not next week.

Here's a fun workaround, that improves the performance of both CPython
(by about 2x for me) and pypy (by 10x or so): turn the many regular
expressions into a single one:

     regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)]
     regex_compiled = re.compile("|".join(regex_strings))

then you replace the match calls with a single one:

     for filename in filenames:
         if regex_compiled.match(filename):
             matches += 1

I believe you can try the same approach for your full program?

Cheers,

Carl Friedrich