From storchaka at gmail.com  Sat Mar  5 13:35:22 2016
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sat, 5 Mar 2016 20:35:22 +0200
Subject: [Speed] Performance comparison of regular expression engines
Message-ID: <56DB26EA.3070005@gmail.com>

I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].

Tested engines are:

* re, standard regular expression module
* regex, alternative regular expression module [3]
* re2, Python wrapper for Google's RE2 [4]
* pcre, Python PCRE bindings [5]

Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):

                                               re  regex    re2   pcre str.find

Twain                                   5   2.866  2.118  12.47  3.911   2.72
(?i)Twain                              10   84.42  4.366  24.76  17.12
[a-z]shing                            165     125  5.466  27.78  180.6
Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   57.11  72.16  23.87    234
\b\w+nn\b                              32   239.5  427.6  23.18  251.9
[a-q][^u-z]{13}x                      445   381.8  5.537   5843  224.9
Tom|Sawyer|Huckleberry|Finn           314   52.73  58.45  24.39  422.5
(?i)Tom|Sawyer|Huckleberry|Finn       477   445.6  522.1  27.73  415.4
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   451.2   1113  24.38   1497
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   450.1   1000   24.3   1549
Tom.{10,25}river|river.{10,25}Tom       1   61.55  58.11  24.97  233.8
[a-zA-Z]+ing                        10079   189.4  350.3  47.41  357.6
\s[a-zA-Z]{0,12}ing\s                7160   115.7  23.65  37.74  237.6
([A-Za-z]awyer|[A-Za-z]inn)\s          50   153.7  430.4  27.86  425.3
["'][^"']{0,30}[?!\.]["']            1618   83.12  77.39  26.96  157.6

There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.

pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:

                                               re  regex    re2   pcre str.find

Twain                                  33   2.671  2.209   16.6  413.6   2.75
(?i)Twain                              35   90.21   4.36  27.65  459.4
[a-z]shing                            120   112.7  2.667  30.94   1895
Huck[a-zA-Z]+|Saw[a-zA-Z]+             61   57.12   49.9  26.76   1152
\b\w+nn\b                              33     238  401.4  26.93  763.7
[a-q][^u-z]{13}x                      481   387.7  5.694   5915   6979
Tom|Sawyer|Huckleberry|Finn           845   52.89  59.61  28.42 1.228e+04
(?i)Tom|Sawyer|Huckleberry|Finn       988   452.3  523.4  32.15 1.426e+04
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   845   421.1   1105  29.01 1.343e+04
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   625   398.6  985.6  29.19   9878
Tom.{10,25}river|river.{10,25}Tom       1    61.6  58.33  26.59  254.1
[a-zA-Z]+ing                        10109   194.5  349.7  50.85 1.445e+05
\s[a-zA-Z]{0,12}ing\s                7286   120.1  23.73  42.04 1.051e+05
([A-Za-z]awyer|[A-Za-z]inn)\s          43   170.6  402.9  30.84   1119
["'][^"']{0,30}[?!\.]["']            1686    86.5  110.2  30.62 2.369e+04

[1] http://sljit.sourceforge.net/regex_perf.html
[2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
[3] https://pypi.python.org/pypi/regex/2016.03.02
[4] https://pypi.python.org/pypi/re2/0.2.22
[5] https://pypi.python.org/pypi/python-pcre/0.7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: regex_bench.py
Type: text/x-python
Size: 3405 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20160305/a89e2435/attachment.py>

From fijall at gmail.com  Sun Mar  6 02:14:03 2016
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Sun, 6 Mar 2016 09:14:03 +0200
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <56DB26EA.3070005@gmail.com>
References: <56DB26EA.3070005@gmail.com>
Message-ID: <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>

Hi serhiy

Any chance you can rerun this on pypy?

On Sat, Mar 5, 2016 at 8:35 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].
>
> Tested engines are:
>
> * re, standard regular expression module
> * regex, alternative regular expression module [3]
> * re2, Python wrapper for Google's RE2 [4]
> * pcre, Python PCRE bindings [5]
>
> Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):
>
>                                                re  regex    re2   pcre str.find
>
> Twain                                   5   2.866  2.118  12.47  3.911   2.72
> (?i)Twain                              10   84.42  4.366  24.76  17.12
> [a-z]shing                            165     125  5.466  27.78  180.6
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   57.11  72.16  23.87    234
> \b\w+nn\b                              32   239.5  427.6  23.18  251.9
> [a-q][^u-z]{13}x                      445   381.8  5.537   5843  224.9
> Tom|Sawyer|Huckleberry|Finn           314   52.73  58.45  24.39  422.5
> (?i)Tom|Sawyer|Huckleberry|Finn       477   445.6  522.1  27.73  415.4
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   451.2   1113  24.38   1497
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   450.1   1000   24.3   1549
> Tom.{10,25}river|river.{10,25}Tom       1   61.55  58.11  24.97  233.8
> [a-zA-Z]+ing                        10079   189.4  350.3  47.41  357.6
> \s[a-zA-Z]{0,12}ing\s                7160   115.7  23.65  37.74  237.6
> ([A-Za-z]awyer|[A-Za-z]inn)\s          50   153.7  430.4  27.86  425.3
> ["'][^"']{0,30}[?!\.]["']            1618   83.12  77.39  26.96  157.6
>
> There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.
>
> pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:
>
>                                                re  regex    re2   pcre str.find
>
> Twain                                  33   2.671  2.209   16.6  413.6   2.75
> (?i)Twain                              35   90.21   4.36  27.65  459.4
> [a-z]shing                            120   112.7  2.667  30.94   1895
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             61   57.12   49.9  26.76   1152
> \b\w+nn\b                              33     238  401.4  26.93  763.7
> [a-q][^u-z]{13}x                      481   387.7  5.694   5915   6979
> Tom|Sawyer|Huckleberry|Finn           845   52.89  59.61  28.42 1.228e+04
> (?i)Tom|Sawyer|Huckleberry|Finn       988   452.3  523.4  32.15 1.426e+04
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   845   421.1   1105  29.01 1.343e+04
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   625   398.6  985.6  29.19   9878
> Tom.{10,25}river|river.{10,25}Tom       1    61.6  58.33  26.59  254.1
> [a-zA-Z]+ing                        10109   194.5  349.7  50.85 1.445e+05
> \s[a-zA-Z]{0,12}ing\s                7286   120.1  23.73  42.04 1.051e+05
> ([A-Za-z]awyer|[A-Za-z]inn)\s          43   170.6  402.9  30.84   1119
> ["'][^"']{0,30}[?!\.]["']            1686    86.5  110.2  30.62 2.369e+04
>
> [1] http://sljit.sourceforge.net/regex_perf.html
> [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
> [3] https://pypi.python.org/pypi/regex/2016.03.02
> [4] https://pypi.python.org/pypi/re2/0.2.22
> [5] https://pypi.python.org/pypi/python-pcre/0.7
>
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>

From storchaka at gmail.com  Sun Mar  6 04:21:56 2016
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sun, 6 Mar 2016 11:21:56 +0200
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>
References: <56DB26EA.3070005@gmail.com>
 <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>
Message-ID: <nbgsrl$qsu$1@ger.gmane.org>

On 06.03.16 09:14, Maciej Fijalkowski wrote:
> Any chance you can rerun this on pypy?

Results on PyPy 2.2.1 (I'm not sure I could build the last PyPy on my computer):

                                               re str.find

Twain                                   5   5.469  3.852
(?i)Twain                              10   8.646
[a-z]shing                            165   17.24
Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   7.763
\b\w+nn\b                              32     101
[a-q][^u-z]{13}x                      445   167.6
Tom|Sawyer|Huckleberry|Finn           314   8.583
(?i)Tom|Sawyer|Huckleberry|Finn       477    16.3
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   270.9
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   237     262
Tom.{10,25}river|river.{10,25}Tom       1   8.461
[a-zA-Z]+ing                        10079     348
\s[a-zA-Z]{0,12}ing\s                7160   115.8
([A-Za-z]awyer|[A-Za-z]inn)\s          50   16.62
["'][^"']{0,30}[?!\.]["']            1618   14.45

Alternative regular expression engines need extension modules and don't work on PyPy for me.

For comparison results on CPython 2.7.11+:

                                               re  regex    re2   pcre str.find

Twain                                   5   4.423  2.699  8.045   93.4  4.181
(?i)Twain                              10   50.07  3.563  20.35  185.6
[a-z]shing                            165   98.68  6.365  23.71   2886
Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   58.97  50.26  19.52   1016
\b\w+nn\b                              32   130.1  416.5  18.38  740.7
[a-q][^u-z]{13}x                      445   406.6  7.935   5886   7137
Tom|Sawyer|Huckleberry|Finn           314   53.09   59.1  20.33   5377
(?i)Tom|Sawyer|Huckleberry|Finn       477   281.2  338.5  23.77   7895
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   419.5   1142  20.69   6423
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   410.9   1013  18.99   5224
Tom.{10,25}river|river.{10,25}Tom       1   63.17  58.31  18.94  260.2
[a-zA-Z]+ing                        10079   203.8  363.8  43.78 1.583e+05
\s[a-zA-Z]{0,12}ing\s                7160   127.1  26.65  34.23 1.114e+05
([A-Za-z]awyer|[A-Za-z]inn)\s          50   147.6  412.4  21.57   1172
["'][^"']{0,30}[?!\.]["']            1618   85.88  86.55  22.22 2.576e+04

And on Jython 2.5.3 with JRE 7:

                                               re str.find

Twain                                   5      34      3
(?i)Twain                              10     251
[a-z]shing                            165     564
Huck[a-zA-Z]+|Saw[a-zA-Z]+             52     281
\b\w+nn\b                              32     510
[a-q][^u-z]{13}x                      445    1786
Tom|Sawyer|Huckleberry|Finn           314     102
(?i)Tom|Sawyer|Huckleberry|Finn       477    1232
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   314    1345
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   237    1353
Tom.{10,25}river|river.{10,25}Tom       1     305
[a-zA-Z]+ing                        10079    1211
\s[a-zA-Z]{0,12}ing\s                7160     571
([A-Za-z]awyer|[A-Za-z]inn)\s          50     676
["'][^"']{0,30}[?!\.]["']            1618     431


From fijall at gmail.com  Sun Mar  6 04:30:15 2016
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Sun, 6 Mar 2016 11:30:15 +0200
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <nbgsrl$qsu$1@ger.gmane.org>
References: <56DB26EA.3070005@gmail.com>
 <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>
 <nbgsrl$qsu$1@ger.gmane.org>
Message-ID: <CAK5idxTLbox+40h9-MSZQgtvOgLH2g7g2f3LMRSOFfbwv82eLw@mail.gmail.com>

this is really difficult to read, can you tell me which column am I looking at?

On Sun, Mar 6, 2016 at 11:21 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 06.03.16 09:14, Maciej Fijalkowski wrote:
>> Any chance you can rerun this on pypy?
>
> Results on PyPy 2.2.1 (I'm not sure I could build the last PyPy on my computer):
>
>                                                re str.find
>
> Twain                                   5 5.469 3.852
> (?i)Twain                              10   8.646
> [a-z]shing                            165   17.24
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   7.763
> \b\w+nn\b                              32     101
> [a-q][^u-z]{13}x                      445   167.6
> Tom|Sawyer|Huckleberry|Finn           314   8.583
> (?i)Tom|Sawyer|Huckleberry|Finn       477    16.3
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   270.9
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   237     262
> Tom.{10,25}river|river.{10,25}Tom       1   8.461
> [a-zA-Z]+ing                        10079     348
> \s[a-zA-Z]{0,12}ing\s                7160   115.8
> ([A-Za-z]awyer|[A-Za-z]inn)\s          50   16.62
> ["'][^"']{0,30}[?!\.]["']            1618   14.45
>
> Alternative regular expression engines need extension modules and don't work on PyPy for me.
>
> For comparison results on CPython 2.7.11+:
>
>                                                re  regex    re2   pcre str.find
>
> Twain                                   5   4.423  2.699  8.045   93.4  4.181
> (?i)Twain                              10   50.07  3.563  20.35  185.6
> [a-z]shing                            165   98.68  6.365  23.71   2886
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   58.97  50.26  19.52   1016
> \b\w+nn\b                              32   130.1  416.5  18.38  740.7
> [a-q][^u-z]{13}x                      445   406.6  7.935   5886   7137
> Tom|Sawyer|Huckleberry|Finn           314   53.09   59.1  20.33   5377
> (?i)Tom|Sawyer|Huckleberry|Finn       477   281.2  338.5  23.77   7895
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   419.5   1142  20.69   6423
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   410.9   1013  18.99   5224
> Tom.{10,25}river|river.{10,25}Tom       1   63.17  58.31  18.94  260.2
> [a-zA-Z]+ing                        10079   203.8  363.8  43.78 1.583e+05
> \s[a-zA-Z]{0,12}ing\s                7160   127.1  26.65  34.23 1.114e+05
> ([A-Za-z]awyer|[A-Za-z]inn)\s          50   147.6  412.4  21.57   1172
> ["'][^"']{0,30}[?!\.]["']            1618   85.88  86.55  22.22 2.576e+04
>
> And on Jython 2.5.3 with JRE 7:
>
>                                                re str.find
>
> Twain                                   5      34      3
> (?i)Twain                              10     251
> [a-z]shing                            165     564
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             52     281
> \b\w+nn\b                              32     510
> [a-q][^u-z]{13}x                      445    1786
> Tom|Sawyer|Huckleberry|Finn           314     102
> (?i)Tom|Sawyer|Huckleberry|Finn       477    1232
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   314    1345
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   237    1353
> Tom.{10,25}river|river.{10,25}Tom       1     305
> [a-zA-Z]+ing                        10079    1211
> \s[a-zA-Z]{0,12}ing\s                7160     571
> ([A-Za-z]awyer|[A-Za-z]inn)\s          50     676
> ["'][^"']{0,30}[?!\.]["']            1618     431
>
>
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed

From storchaka at gmail.com  Sun Mar  6 10:03:06 2016
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sun, 6 Mar 2016 17:03:06 +0200
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <CAK5idxTLbox+40h9-MSZQgtvOgLH2g7g2f3LMRSOFfbwv82eLw@mail.gmail.com>
References: <56DB26EA.3070005@gmail.com>
 <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>
 <nbgsrl$qsu$1@ger.gmane.org>
 <CAK5idxTLbox+40h9-MSZQgtvOgLH2g7g2f3LMRSOFfbwv82eLw@mail.gmail.com>
Message-ID: <nbhgra$oqj$1@ger.gmane.org>

On 06.03.16 11:30, Maciej Fijalkowski wrote:
> this is really difficult to read, can you tell me which column am I looking at?

The first column is the searched pattern. The second column is the 
number of found matches (for control, it should be the same with all 
engines and versions). The third column, under the "re" header is the 
time in milliseconds. The column under the "str.find" header is the time 
of searching without using regular expressions.

PyPy 2.2 usually is significantly faster than CPython 2.7, except 
searching plain string with regular expression. But thanks to Flexible 
String Representation searching plain string with and without regular 
expression is faster on CPython 3.6.


From brett at python.org  Mon Mar  7 12:19:25 2016
From: brett at python.org (Brett Cannon)
Date: Mon, 07 Mar 2016 17:19:25 +0000
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <56DB26EA.3070005@gmail.com>
References: <56DB26EA.3070005@gmail.com>
Message-ID: <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>

Are you thinking about turning all of this into a benchmark for the
benchmark suite?

On Sat, 5 Mar 2016 at 11:15 Serhiy Storchaka <storchaka at gmail.com> wrote:

> I have wrote a benchmark for comparing different regular expression
> engines available in Python. It uses tests and data from [1], that were
> itself inspired by Boost's benchmark [2].
>
> Tested engines are:
>
> * re, standard regular expression module
> * regex, alternative regular expression module [3]
> * re2, Python wrapper for Google's RE2 [4]
> * pcre, Python PCRE bindings [5]
>
> Running tests for all 20MB text file takes too long time, here are results
> (time in millisecons) for 2MB chunk (6000000:8000000):
>
>                                                re  regex    re2   pcre
> str.find
>
> Twain                                   5   2.866  2.118  12.47  3.911
>  2.72
> (?i)Twain                              10   84.42  4.366  24.76  17.12
> [a-z]shing                            165     125  5.466  27.78  180.6
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   57.11  72.16  23.87    234
> \b\w+nn\b                              32   239.5  427.6  23.18  251.9
> [a-q][^u-z]{13}x                      445   381.8  5.537   5843  224.9
> Tom|Sawyer|Huckleberry|Finn           314   52.73  58.45  24.39  422.5
> (?i)Tom|Sawyer|Huckleberry|Finn       477   445.6  522.1  27.73  415.4
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   451.2   1113  24.38   1497
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   450.1   1000   24.3   1549
> Tom.{10,25}river|river.{10,25}Tom       1   61.55  58.11  24.97  233.8
> [a-zA-Z]+ing                        10079   189.4  350.3  47.41  357.6
> \s[a-zA-Z]{0,12}ing\s                7160   115.7  23.65  37.74  237.6
> ([A-Za-z]awyer|[A-Za-z]inn)\s          50   153.7  430.4  27.86  425.3
> ["'][^"']{0,30}[?!\.]["']            1618   83.12  77.39  26.96  157.6
>
> There is no absolute leader. All engines have its weak spots. For re these
> are case-insensitive search and search a pattern that starts with a set.
>
> pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000)
> results are 1-2 orders slower:
>
>                                                re  regex    re2   pcre
> str.find
>
> Twain                                  33   2.671  2.209   16.6  413.6
>  2.75
> (?i)Twain                              35   90.21   4.36  27.65  459.4
> [a-z]shing                            120   112.7  2.667  30.94   1895
> Huck[a-zA-Z]+|Saw[a-zA-Z]+             61   57.12   49.9  26.76   1152
> \b\w+nn\b                              33     238  401.4  26.93  763.7
> [a-q][^u-z]{13}x                      481   387.7  5.694   5915   6979
> Tom|Sawyer|Huckleberry|Finn           845   52.89  59.61  28.42 1.228e+04
> (?i)Tom|Sawyer|Huckleberry|Finn       988   452.3  523.4  32.15 1.426e+04
> .{0,2}(Tom|Sawyer|Huckleberry|Finn)   845   421.1   1105  29.01 1.343e+04
> .{2,4}(Tom|Sawyer|Huckleberry|Finn)   625   398.6  985.6  29.19   9878
> Tom.{10,25}river|river.{10,25}Tom       1    61.6  58.33  26.59  254.1
> [a-zA-Z]+ing                        10109   194.5  349.7  50.85 1.445e+05
> \s[a-zA-Z]{0,12}ing\s                7286   120.1  23.73  42.04 1.051e+05
> ([A-Za-z]awyer|[A-Za-z]inn)\s          43   170.6  402.9  30.84   1119
> ["'][^"']{0,30}[?!\.]["']            1686    86.5  110.2  30.62 2.369e+04
>
> [1] http://sljit.sourceforge.net/regex_perf.html
> [2]
> http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
> [3] https://pypi.python.org/pypi/regex/2016.03.02
> [4] https://pypi.python.org/pypi/re2/0.2.22
> [5] https://pypi.python.org/pypi/python-pcre/0.7
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160307/07b20d86/attachment.html>

From storchaka at gmail.com  Sat Mar 12 13:16:24 2016
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sat, 12 Mar 2016 20:16:24 +0200
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>
References: <56DB26EA.3070005@gmail.com>
 <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>
Message-ID: <nc1mdp$q2h$1@ger.gmane.org>

On 07.03.16 19:19, Brett Cannon wrote:
> Are you thinking about turning all of this into a benchmark for the
> benchmark suite?

This was my purpose. I first had written a benchmark for the benchmark 
suite, then I became interested in more detailed results and a 
comparison with alternative engines.

There are several questions about a benchmark for the benchmark suite.

1. Input data is public 20MB text (8MB in ZIP file). Should we download 
it every time (may be with caching) or add it to the repository?

2. One iteration of all searches on full text takes 29 seconds on my 
computer. Isn't this too long? In any case I want first optimize some 
bottlenecks in the re module.

3. Do we need one benchmark that gives an accumulated time of all 
searches, or separate microbenchmarks for every pattern?

4. Would be nice to use the same benchmark for comparing different 
regular expression. This requires changing perf.py. May be we could use 
the same interface to compare ElementTree with lxml and json with 
simplejson.

5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice 
to add non-ASCII pattern and non-ASCII text. But this will increase run 
time.


From brett at python.org  Sun Mar 13 13:44:10 2016
From: brett at python.org (Brett Cannon)
Date: Sun, 13 Mar 2016 17:44:10 +0000
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <nc1mdp$q2h$1@ger.gmane.org>
References: <56DB26EA.3070005@gmail.com>
 <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>
 <nc1mdp$q2h$1@ger.gmane.org>
Message-ID: <CAP1=2W69mkjyNi8tMQdqMw2snPPOw4=yhXQZ1LKt5EWWJUJFMw@mail.gmail.com>

On Sat, 12 Mar 2016 at 10:16 Serhiy Storchaka <storchaka at gmail.com> wrote:

> On 07.03.16 19:19, Brett Cannon wrote:
> > Are you thinking about turning all of this into a benchmark for the
> > benchmark suite?
>
> This was my purpose. I first had written a benchmark for the benchmark
> suite, then I became interested in more detailed results and a
> comparison with alternative engines.
>
> There are several questions about a benchmark for the benchmark suite.
>
> 1. Input data is public 20MB text (8MB in ZIP file). Should we download
> it every time (may be with caching) or add it to the repository?
>

Add it the repository probably (`du -h` on my checkout says the total disk
space used is 280 MB already). I would like to look into what it would take
to use pip to install dependencies so that we don't have such a large
checkout, at which point we could talk about downloading it. But as of
right now we keep it all self-contained to control for the inputs to the
benchmarks.


>
> 2. One iteration of all searches on full text takes 29 seconds on my
> computer. Isn't this too long? In any case I want first optimize some
> bottlenecks in the re module.
>

I don't think we have established a "too long" time. We do have some
benchmarks like spectral_norm that don't run unless you use rigorous mode
and this could be one of them.


>
> 3. Do we need one benchmark that gives an accumulated time of all
> searches, or separate microbenchmarks for every pattern?
>

I don't care either way. Obviously it depends on whether you want to
measure overall re perf and have people aim to improve that or let people
target specific workload types.


>
> 4. Would be nice to use the same benchmark for comparing different
> regular expression. This requires changing perf.py. May be we could use
> the same interface to compare ElementTree with lxml and json with
> simplejson.
>

So there's already an approach to do this when you execute the benchmark
scripts directly through command-line flags. You do lose perf.py's
calculation benefits, though. I personally have no issue if you or anyone
else came up with a way to pass in benchmark-specific flags (i.e., our own
version of -X).


>
> 5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice
> to add non-ASCII pattern and non-ASCII text. But this will increase run
> time.
>

I think that's fine. Better that the benchmark measure something useful
than worry about whether anyone will want to run it in fast mode.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160313/200876d4/attachment.html>

From solipsis at pitrou.net  Mon Mar 14 10:27:08 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 14 Mar 2016 15:27:08 +0100
Subject: [Speed] Performance comparison of regular expression engines
References: <56DB26EA.3070005@gmail.com>
 <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>
 <nc1mdp$q2h$1@ger.gmane.org>
 <CAP1=2W69mkjyNi8tMQdqMw2snPPOw4=yhXQZ1LKt5EWWJUJFMw@mail.gmail.com>
Message-ID: <20160314152708.025586f1@fsol>

On Sun, 13 Mar 2016 17:44:10 +0000
Brett Cannon <brett at python.org> wrote:
> >
> > 2. One iteration of all searches on full text takes 29 seconds on my
> > computer. Isn't this too long? In any case I want first optimize some
> > bottlenecks in the re module.
> >
> 
> I don't think we have established a "too long" time. We do have some
> benchmarks like spectral_norm that don't run unless you use rigorous mode
> and this could be one of them.
> 
> > 3. Do we need one benchmark that gives an accumulated time of all
> > searches, or separate microbenchmarks for every pattern?
> 
> I don't care either way. Obviously it depends on whether you want to
> measure overall re perf and have people aim to improve that or let people
> target specific workload types.

This is a more general latent issue with our current benchmarking
philosophy.  We have built something which aims to be a general-purpose
benchmark suite, but in some domains a more comprehensive set of
benchmarks may be desirable.  Obviously we don't want to have 10 JSON
benchmarks, 10 re benchmarks, 10 I/O benchmarks, etc. in the default
benchmarks run, so what do we do for such cases?  Do we tell people
domain-specific benchmarks should be developed independently?  Do we
include some facilities to create such subsuites without them being
part of the default bunch?

(note a couple domain-specific benchmarks -- iobench, stringbench, etc.
-- are currently maintained separately)

Regards

Antoine.


From brett at python.org  Mon Mar 14 11:40:14 2016
From: brett at python.org (Brett Cannon)
Date: Mon, 14 Mar 2016 15:40:14 +0000
Subject: [Speed] Performance comparison of regular expression engines
In-Reply-To: <20160314152708.025586f1@fsol>
References: <56DB26EA.3070005@gmail.com>
 <CAP1=2W4Xi7aT3R+oP0iNj=8B=8yCUBAzcVEpXcs2MOqKYGi7DA@mail.gmail.com>
 <nc1mdp$q2h$1@ger.gmane.org>
 <CAP1=2W69mkjyNi8tMQdqMw2snPPOw4=yhXQZ1LKt5EWWJUJFMw@mail.gmail.com>
 <20160314152708.025586f1@fsol>
Message-ID: <CAP1=2W6C_BqWOPuLQ8OntuEwjKGVG_cFkmbxUCMyeAJGp6f34w@mail.gmail.com>

On Mon, 14 Mar 2016 at 07:27 Antoine Pitrou <solipsis at pitrou.net> wrote:

> On Sun, 13 Mar 2016 17:44:10 +0000
> Brett Cannon <brett at python.org> wrote:
> > >
> > > 2. One iteration of all searches on full text takes 29 seconds on my
> > > computer. Isn't this too long? In any case I want first optimize some
> > > bottlenecks in the re module.
> > >
> >
> > I don't think we have established a "too long" time. We do have some
> > benchmarks like spectral_norm that don't run unless you use rigorous mode
> > and this could be one of them.
> >
> > > 3. Do we need one benchmark that gives an accumulated time of all
> > > searches, or separate microbenchmarks for every pattern?
> >
> > I don't care either way. Obviously it depends on whether you want to
> > measure overall re perf and have people aim to improve that or let people
> > target specific workload types.
>
> This is a more general latent issue with our current benchmarking
> philosophy.  We have built something which aims to be a general-purpose
> benchmark suite, but in some domains a more comprehensive set of
> benchmarks may be desirable.  Obviously we don't want to have 10 JSON
> benchmarks, 10 re benchmarks, 10 I/O benchmarks, etc. in the default
> benchmarks run, so what do we do for such cases?  Do we tell people
> domain-specific benchmarks should be developed independently?  Do we
> include some facilities to create such subsuites without them being
> part of the default bunch?
>
> (note a couple domain-specific benchmarks -- iobench, stringbench, etc.
> -- are currently maintained separately)
>

Good point. I personally don't have a good feel on how to handle this. Part
of me would like to consolidate the benchmarks so that it's easier to
discover what benchmarks there are. Another part of me doesn't want to
burden folks writing there own benchmarks for development purposes too much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160314/69fb4b39/attachment.html>

From mount.sarah at gmail.com  Thu Mar 24 07:05:52 2016
From: mount.sarah at gmail.com (Sarah Mount)
Date: Thu, 24 Mar 2016 11:05:52 +0000
Subject: [Speed] Software benchmarking workshop, April 20,
 King's College London
Message-ID: <CALG3WZk_jqZAD=Y3RB9=PbyTkqQOmNkrFtbDMFCDK7wfqN6wFg@mail.gmail.com>

Members of this list who are in the UK during April may be interested in
this free workshop in central London. If you have any questions please
feel free to email me directly.


Best Practices in Software Benchmarking 2016 (#bench16)
Wednesday April 20 2016
King's College London
http://soft-dev.org/events/bench16/


For computer scientists and software engineers, benchmarking (evaluating the
running time of a piece of software, or the performance of a piece of hardware)
is a common method for evaluating new techniques. However, there is little
agreement on how benchmarking should  be carried out, how to control for
confounding variables, how to analyse latency data, or  how to aid the
repeatability of experiments. This free workshop will be a venue for computer
scientists and research software engineers to discuss their current best
practices and  future directions.


For further information and free registration please visit:
http://soft-dev.org/events/bench16/


Confirmed Speakers:

Jan Vitek (Northeastern University)
Joe Parker (The Jodrell Laboratory, Royal Botanic Gardens)
Simon Taylor (University of Lancaster)
Tomas Kalibera (Northeastern University)
James Davenport (University of Bath)
Edd Barrett (King's College London)
Jeremy Bennett (Embecosm)


Organizers:

Sarah Mount & Laurence Tratt (King's College London)