[Speed] Performance comparison of regular expression engines

Sun Mar 13 13:44:10 EDT 2016

On Sat, 12 Mar 2016 at 10:16 Serhiy Storchaka <storchaka at gmail.com> wrote:

> On 07.03.16 19:19, Brett Cannon wrote:
> > Are you thinking about turning all of this into a benchmark for the
> > benchmark suite?
>
> This was my purpose. I first had written a benchmark for the benchmark
> suite, then I became interested in more detailed results and a
> comparison with alternative engines.
>
> There are several questions about a benchmark for the benchmark suite.
>
> 1. Input data is public 20MB text (8MB in ZIP file). Should we download
> it every time (may be with caching) or add it to the repository?
>

Add it the repository probably (`du -h` on my checkout says the total disk
space used is 280 MB already). I would like to look into what it would take
to use pip to install dependencies so that we don't have such a large
checkout, at which point we could talk about downloading it. But as of
right now we keep it all self-contained to control for the inputs to the
benchmarks.

>
> 2. One iteration of all searches on full text takes 29 seconds on my
> computer. Isn't this too long? In any case I want first optimize some
> bottlenecks in the re module.
>

I don't think we have established a "too long" time. We do have some
benchmarks like spectral_norm that don't run unless you use rigorous mode
and this could be one of them.

>
> 3. Do we need one benchmark that gives an accumulated time of all
> searches, or separate microbenchmarks for every pattern?
>

I don't care either way. Obviously it depends on whether you want to
measure overall re perf and have people aim to improve that or let people
target specific workload types.

>
> 4. Would be nice to use the same benchmark for comparing different
> regular expression. This requires changing perf.py. May be we could use
> the same interface to compare ElementTree with lxml and json with
> simplejson.
>

So there's already an approach to do this when you execute the benchmark
scripts directly through command-line flags. You do lose perf.py's
calculation benefits, though. I personally have no issue if you or anyone
else came up with a way to pass in benchmark-specific flags (i.e., our own
version of -X).

>
> 5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice
> to add non-ASCII pattern and non-ASCII text. But this will increase run
> time.
>

I think that's fine. Better that the benchmark measure something useful
than worry about whether anyone will want to run it in fast mode.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160313/200876d4/attachment.html>