From storchaka at gmail.com Sat Mar 5 13:35:22 2016 From: storchaka at gmail.com (Serhiy Storchaka) Date: Sat, 5 Mar 2016 20:35:22 +0200 Subject: [Speed] Performance comparison of regular expression engines Message-ID: <56DB26EA.3070005@gmail.com> I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2]. Tested engines are: * re, standard regular expression module * regex, alternative regular expression module [3] * re2, Python wrapper for Google's RE2 [4] * pcre, Python PCRE bindings [5] Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000): re regex re2 pcre str.find Twain 5 2.866 2.118 12.47 3.911 2.72 (?i)Twain 10 84.42 4.366 24.76 17.12 [a-z]shing 165 125 5.466 27.78 180.6 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 \b\w+nn\b 32 239.5 427.6 23.18 251.9 [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6 There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set. pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower: re regex re2 pcre str.find Twain 33 2.671 2.209 16.6 413.6 2.75 (?i)Twain 35 90.21 4.36 27.65 459.4 [a-z]shing 120 112.7 2.667 30.94 1895 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 \b\w+nn\b 33 238 401.4 26.93 763.7 [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04 [1] http://sljit.sourceforge.net/regex_perf.html [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html [3] https://pypi.python.org/pypi/regex/2016.03.02 [4] https://pypi.python.org/pypi/re2/0.2.22 [5] https://pypi.python.org/pypi/python-pcre/0.7 -------------- next part -------------- A non-text attachment was scrubbed... Name: regex_bench.py Type: text/x-python Size: 3405 bytes Desc: not available URL: From fijall at gmail.com Sun Mar 6 02:14:03 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 09:14:03 +0200 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: <56DB26EA.3070005@gmail.com> References: <56DB26EA.3070005@gmail.com> Message-ID: Hi serhiy Any chance you can rerun this on pypy? On Sat, Mar 5, 2016 at 8:35 PM, Serhiy Storchaka wrote: > I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2]. > > Tested engines are: > > * re, standard regular expression module > * regex, alternative regular expression module [3] > * re2, Python wrapper for Google's RE2 [4] > * pcre, Python PCRE bindings [5] > > Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000): > > re regex re2 pcre str.find > > Twain 5 2.866 2.118 12.47 3.911 2.72 > (?i)Twain 10 84.42 4.366 24.76 17.12 > [a-z]shing 165 125 5.466 27.78 180.6 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 > \b\w+nn\b 32 239.5 427.6 23.18 251.9 > [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 > Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 > (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 > Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 > [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 > \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 > ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 > ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6 > > There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set. > > pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower: > > re regex re2 pcre str.find > > Twain 33 2.671 2.209 16.6 413.6 2.75 > (?i)Twain 35 90.21 4.36 27.65 459.4 > [a-z]shing 120 112.7 2.667 30.94 1895 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 > \b\w+nn\b 33 238 401.4 26.93 763.7 > [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 > Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 > (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 > Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 > [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 > \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 > ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 > ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04 > > [1] http://sljit.sourceforge.net/regex_perf.html > [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html > [3] https://pypi.python.org/pypi/regex/2016.03.02 > [4] https://pypi.python.org/pypi/re2/0.2.22 > [5] https://pypi.python.org/pypi/python-pcre/0.7 > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > From storchaka at gmail.com Sun Mar 6 04:21:56 2016 From: storchaka at gmail.com (Serhiy Storchaka) Date: Sun, 6 Mar 2016 11:21:56 +0200 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: References: <56DB26EA.3070005@gmail.com> Message-ID: On 06.03.16 09:14, Maciej Fijalkowski wrote: > Any chance you can rerun this on pypy? Results on PyPy 2.2.1 (I'm not sure I could build the last PyPy on my computer): re str.find Twain 5 5.469 3.852 (?i)Twain 10 8.646 [a-z]shing 165 17.24 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 7.763 \b\w+nn\b 32 101 [a-q][^u-z]{13}x 445 167.6 Tom|Sawyer|Huckleberry|Finn 314 8.583 (?i)Tom|Sawyer|Huckleberry|Finn 477 16.3 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 270.9 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 262 Tom.{10,25}river|river.{10,25}Tom 1 8.461 [a-zA-Z]+ing 10079 348 \s[a-zA-Z]{0,12}ing\s 7160 115.8 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 16.62 ["'][^"']{0,30}[?!\.]["'] 1618 14.45 Alternative regular expression engines need extension modules and don't work on PyPy for me. For comparison results on CPython 2.7.11+: re regex re2 pcre str.find Twain 5 4.423 2.699 8.045 93.4 4.181 (?i)Twain 10 50.07 3.563 20.35 185.6 [a-z]shing 165 98.68 6.365 23.71 2886 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 58.97 50.26 19.52 1016 \b\w+nn\b 32 130.1 416.5 18.38 740.7 [a-q][^u-z]{13}x 445 406.6 7.935 5886 7137 Tom|Sawyer|Huckleberry|Finn 314 53.09 59.1 20.33 5377 (?i)Tom|Sawyer|Huckleberry|Finn 477 281.2 338.5 23.77 7895 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 419.5 1142 20.69 6423 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 410.9 1013 18.99 5224 Tom.{10,25}river|river.{10,25}Tom 1 63.17 58.31 18.94 260.2 [a-zA-Z]+ing 10079 203.8 363.8 43.78 1.583e+05 \s[a-zA-Z]{0,12}ing\s 7160 127.1 26.65 34.23 1.114e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 147.6 412.4 21.57 1172 ["'][^"']{0,30}[?!\.]["'] 1618 85.88 86.55 22.22 2.576e+04 And on Jython 2.5.3 with JRE 7: re str.find Twain 5 34 3 (?i)Twain 10 251 [a-z]shing 165 564 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 281 \b\w+nn\b 32 510 [a-q][^u-z]{13}x 445 1786 Tom|Sawyer|Huckleberry|Finn 314 102 (?i)Tom|Sawyer|Huckleberry|Finn 477 1232 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 1345 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 1353 Tom.{10,25}river|river.{10,25}Tom 1 305 [a-zA-Z]+ing 10079 1211 \s[a-zA-Z]{0,12}ing\s 7160 571 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 676 ["'][^"']{0,30}[?!\.]["'] 1618 431 From fijall at gmail.com Sun Mar 6 04:30:15 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 11:30:15 +0200 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: References: <56DB26EA.3070005@gmail.com> Message-ID: this is really difficult to read, can you tell me which column am I looking at? On Sun, Mar 6, 2016 at 11:21 AM, Serhiy Storchaka wrote: > On 06.03.16 09:14, Maciej Fijalkowski wrote: >> Any chance you can rerun this on pypy? > > Results on PyPy 2.2.1 (I'm not sure I could build the last PyPy on my computer): > > re str.find > > Twain 5 5.469 3.852 > (?i)Twain 10 8.646 > [a-z]shing 165 17.24 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 7.763 > \b\w+nn\b 32 101 > [a-q][^u-z]{13}x 445 167.6 > Tom|Sawyer|Huckleberry|Finn 314 8.583 > (?i)Tom|Sawyer|Huckleberry|Finn 477 16.3 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 270.9 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 262 > Tom.{10,25}river|river.{10,25}Tom 1 8.461 > [a-zA-Z]+ing 10079 348 > \s[a-zA-Z]{0,12}ing\s 7160 115.8 > ([A-Za-z]awyer|[A-Za-z]inn)\s 50 16.62 > ["'][^"']{0,30}[?!\.]["'] 1618 14.45 > > Alternative regular expression engines need extension modules and don't work on PyPy for me. > > For comparison results on CPython 2.7.11+: > > re regex re2 pcre str.find > > Twain 5 4.423 2.699 8.045 93.4 4.181 > (?i)Twain 10 50.07 3.563 20.35 185.6 > [a-z]shing 165 98.68 6.365 23.71 2886 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 58.97 50.26 19.52 1016 > \b\w+nn\b 32 130.1 416.5 18.38 740.7 > [a-q][^u-z]{13}x 445 406.6 7.935 5886 7137 > Tom|Sawyer|Huckleberry|Finn 314 53.09 59.1 20.33 5377 > (?i)Tom|Sawyer|Huckleberry|Finn 477 281.2 338.5 23.77 7895 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 419.5 1142 20.69 6423 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 410.9 1013 18.99 5224 > Tom.{10,25}river|river.{10,25}Tom 1 63.17 58.31 18.94 260.2 > [a-zA-Z]+ing 10079 203.8 363.8 43.78 1.583e+05 > \s[a-zA-Z]{0,12}ing\s 7160 127.1 26.65 34.23 1.114e+05 > ([A-Za-z]awyer|[A-Za-z]inn)\s 50 147.6 412.4 21.57 1172 > ["'][^"']{0,30}[?!\.]["'] 1618 85.88 86.55 22.22 2.576e+04 > > And on Jython 2.5.3 with JRE 7: > > re str.find > > Twain 5 34 3 > (?i)Twain 10 251 > [a-z]shing 165 564 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 281 > \b\w+nn\b 32 510 > [a-q][^u-z]{13}x 445 1786 > Tom|Sawyer|Huckleberry|Finn 314 102 > (?i)Tom|Sawyer|Huckleberry|Finn 477 1232 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 1345 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 1353 > Tom.{10,25}river|river.{10,25}Tom 1 305 > [a-zA-Z]+ing 10079 1211 > \s[a-zA-Z]{0,12}ing\s 7160 571 > ([A-Za-z]awyer|[A-Za-z]inn)\s 50 676 > ["'][^"']{0,30}[?!\.]["'] 1618 431 > > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From storchaka at gmail.com Sun Mar 6 10:03:06 2016 From: storchaka at gmail.com (Serhiy Storchaka) Date: Sun, 6 Mar 2016 17:03:06 +0200 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: References: <56DB26EA.3070005@gmail.com> Message-ID: On 06.03.16 11:30, Maciej Fijalkowski wrote: > this is really difficult to read, can you tell me which column am I looking at? The first column is the searched pattern. The second column is the number of found matches (for control, it should be the same with all engines and versions). The third column, under the "re" header is the time in milliseconds. The column under the "str.find" header is the time of searching without using regular expressions. PyPy 2.2 usually is significantly faster than CPython 2.7, except searching plain string with regular expression. But thanks to Flexible String Representation searching plain string with and without regular expression is faster on CPython 3.6. From brett at python.org Mon Mar 7 12:19:25 2016 From: brett at python.org (Brett Cannon) Date: Mon, 07 Mar 2016 17:19:25 +0000 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: <56DB26EA.3070005@gmail.com> References: <56DB26EA.3070005@gmail.com> Message-ID: Are you thinking about turning all of this into a benchmark for the benchmark suite? On Sat, 5 Mar 2016 at 11:15 Serhiy Storchaka wrote: > I have wrote a benchmark for comparing different regular expression > engines available in Python. It uses tests and data from [1], that were > itself inspired by Boost's benchmark [2]. > > Tested engines are: > > * re, standard regular expression module > * regex, alternative regular expression module [3] > * re2, Python wrapper for Google's RE2 [4] > * pcre, Python PCRE bindings [5] > > Running tests for all 20MB text file takes too long time, here are results > (time in millisecons) for 2MB chunk (6000000:8000000): > > re regex re2 pcre > str.find > > Twain 5 2.866 2.118 12.47 3.911 > 2.72 > (?i)Twain 10 84.42 4.366 24.76 17.12 > [a-z]shing 165 125 5.466 27.78 180.6 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 > \b\w+nn\b 32 239.5 427.6 23.18 251.9 > [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 > Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 > (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 > Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 > [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 > \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 > ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 > ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6 > > There is no absolute leader. All engines have its weak spots. For re these > are case-insensitive search and search a pattern that starts with a set. > > pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) > results are 1-2 orders slower: > > re regex re2 pcre > str.find > > Twain 33 2.671 2.209 16.6 413.6 > 2.75 > (?i)Twain 35 90.21 4.36 27.65 459.4 > [a-z]shing 120 112.7 2.667 30.94 1895 > Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 > \b\w+nn\b 33 238 401.4 26.93 763.7 > [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 > Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 > (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 > .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 > .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 > Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 > [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 > \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 > ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 > ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04 > > [1] http://sljit.sourceforge.net/regex_perf.html > [2] > http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html > [3] https://pypi.python.org/pypi/regex/2016.03.02 > [4] https://pypi.python.org/pypi/re2/0.2.22 > [5] https://pypi.python.org/pypi/python-pcre/0.7 > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Sat Mar 12 13:16:24 2016 From: storchaka at gmail.com (Serhiy Storchaka) Date: Sat, 12 Mar 2016 20:16:24 +0200 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: References: <56DB26EA.3070005@gmail.com> Message-ID: On 07.03.16 19:19, Brett Cannon wrote: > Are you thinking about turning all of this into a benchmark for the > benchmark suite? This was my purpose. I first had written a benchmark for the benchmark suite, then I became interested in more detailed results and a comparison with alternative engines. There are several questions about a benchmark for the benchmark suite. 1. Input data is public 20MB text (8MB in ZIP file). Should we download it every time (may be with caching) or add it to the repository? 2. One iteration of all searches on full text takes 29 seconds on my computer. Isn't this too long? In any case I want first optimize some bottlenecks in the re module. 3. Do we need one benchmark that gives an accumulated time of all searches, or separate microbenchmarks for every pattern? 4. Would be nice to use the same benchmark for comparing different regular expression. This requires changing perf.py. May be we could use the same interface to compare ElementTree with lxml and json with simplejson. 5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice to add non-ASCII pattern and non-ASCII text. But this will increase run time. From brett at python.org Sun Mar 13 13:44:10 2016 From: brett at python.org (Brett Cannon) Date: Sun, 13 Mar 2016 17:44:10 +0000 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: References: <56DB26EA.3070005@gmail.com> Message-ID: On Sat, 12 Mar 2016 at 10:16 Serhiy Storchaka wrote: > On 07.03.16 19:19, Brett Cannon wrote: > > Are you thinking about turning all of this into a benchmark for the > > benchmark suite? > > This was my purpose. I first had written a benchmark for the benchmark > suite, then I became interested in more detailed results and a > comparison with alternative engines. > > There are several questions about a benchmark for the benchmark suite. > > 1. Input data is public 20MB text (8MB in ZIP file). Should we download > it every time (may be with caching) or add it to the repository? > Add it the repository probably (`du -h` on my checkout says the total disk space used is 280 MB already). I would like to look into what it would take to use pip to install dependencies so that we don't have such a large checkout, at which point we could talk about downloading it. But as of right now we keep it all self-contained to control for the inputs to the benchmarks. > > 2. One iteration of all searches on full text takes 29 seconds on my > computer. Isn't this too long? In any case I want first optimize some > bottlenecks in the re module. > I don't think we have established a "too long" time. We do have some benchmarks like spectral_norm that don't run unless you use rigorous mode and this could be one of them. > > 3. Do we need one benchmark that gives an accumulated time of all > searches, or separate microbenchmarks for every pattern? > I don't care either way. Obviously it depends on whether you want to measure overall re perf and have people aim to improve that or let people target specific workload types. > > 4. Would be nice to use the same benchmark for comparing different > regular expression. This requires changing perf.py. May be we could use > the same interface to compare ElementTree with lxml and json with > simplejson. > So there's already an approach to do this when you execute the benchmark scripts directly through command-line flags. You do lose perf.py's calculation benefits, though. I personally have no issue if you or anyone else came up with a way to pass in benchmark-specific flags (i.e., our own version of -X). > > 5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice > to add non-ASCII pattern and non-ASCII text. But this will increase run > time. > I think that's fine. Better that the benchmark measure something useful than worry about whether anyone will want to run it in fast mode. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Mar 14 10:27:08 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 14 Mar 2016 15:27:08 +0100 Subject: [Speed] Performance comparison of regular expression engines References: <56DB26EA.3070005@gmail.com> Message-ID: <20160314152708.025586f1@fsol> On Sun, 13 Mar 2016 17:44:10 +0000 Brett Cannon wrote: > > > > 2. One iteration of all searches on full text takes 29 seconds on my > > computer. Isn't this too long? In any case I want first optimize some > > bottlenecks in the re module. > > > > I don't think we have established a "too long" time. We do have some > benchmarks like spectral_norm that don't run unless you use rigorous mode > and this could be one of them. > > > 3. Do we need one benchmark that gives an accumulated time of all > > searches, or separate microbenchmarks for every pattern? > > I don't care either way. Obviously it depends on whether you want to > measure overall re perf and have people aim to improve that or let people > target specific workload types. This is a more general latent issue with our current benchmarking philosophy. We have built something which aims to be a general-purpose benchmark suite, but in some domains a more comprehensive set of benchmarks may be desirable. Obviously we don't want to have 10 JSON benchmarks, 10 re benchmarks, 10 I/O benchmarks, etc. in the default benchmarks run, so what do we do for such cases? Do we tell people domain-specific benchmarks should be developed independently? Do we include some facilities to create such subsuites without them being part of the default bunch? (note a couple domain-specific benchmarks -- iobench, stringbench, etc. -- are currently maintained separately) Regards Antoine. From brett at python.org Mon Mar 14 11:40:14 2016 From: brett at python.org (Brett Cannon) Date: Mon, 14 Mar 2016 15:40:14 +0000 Subject: [Speed] Performance comparison of regular expression engines In-Reply-To: <20160314152708.025586f1@fsol> References: <56DB26EA.3070005@gmail.com> <20160314152708.025586f1@fsol> Message-ID: On Mon, 14 Mar 2016 at 07:27 Antoine Pitrou wrote: > On Sun, 13 Mar 2016 17:44:10 +0000 > Brett Cannon wrote: > > > > > > 2. One iteration of all searches on full text takes 29 seconds on my > > > computer. Isn't this too long? In any case I want first optimize some > > > bottlenecks in the re module. > > > > > > > I don't think we have established a "too long" time. We do have some > > benchmarks like spectral_norm that don't run unless you use rigorous mode > > and this could be one of them. > > > > > 3. Do we need one benchmark that gives an accumulated time of all > > > searches, or separate microbenchmarks for every pattern? > > > > I don't care either way. Obviously it depends on whether you want to > > measure overall re perf and have people aim to improve that or let people > > target specific workload types. > > This is a more general latent issue with our current benchmarking > philosophy. We have built something which aims to be a general-purpose > benchmark suite, but in some domains a more comprehensive set of > benchmarks may be desirable. Obviously we don't want to have 10 JSON > benchmarks, 10 re benchmarks, 10 I/O benchmarks, etc. in the default > benchmarks run, so what do we do for such cases? Do we tell people > domain-specific benchmarks should be developed independently? Do we > include some facilities to create such subsuites without them being > part of the default bunch? > > (note a couple domain-specific benchmarks -- iobench, stringbench, etc. > -- are currently maintained separately) > Good point. I personally don't have a good feel on how to handle this. Part of me would like to consolidate the benchmarks so that it's easier to discover what benchmarks there are. Another part of me doesn't want to burden folks writing there own benchmarks for development purposes too much. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mount.sarah at gmail.com Thu Mar 24 07:05:52 2016 From: mount.sarah at gmail.com (Sarah Mount) Date: Thu, 24 Mar 2016 11:05:52 +0000 Subject: [Speed] Software benchmarking workshop, April 20, King's College London Message-ID: Members of this list who are in the UK during April may be interested in this free workshop in central London. If you have any questions please feel free to email me directly. Best Practices in Software Benchmarking 2016 (#bench16) Wednesday April 20 2016 King's College London http://soft-dev.org/events/bench16/ For computer scientists and software engineers, benchmarking (evaluating the running time of a piece of software, or the performance of a piece of hardware) is a common method for evaluating new techniques. However, there is little agreement on how benchmarking should be carried out, how to control for confounding variables, how to analyse latency data, or how to aid the repeatability of experiments. This free workshop will be a venue for computer scientists and research software engineers to discuss their current best practices and future directions. For further information and free registration please visit: http://soft-dev.org/events/bench16/ Confirmed Speakers: Jan Vitek (Northeastern University) Joe Parker (The Jodrell Laboratory, Royal Botanic Gardens) Simon Taylor (University of Lancaster) Tomas Kalibera (Northeastern University) James Davenport (University of Bath) Edd Barrett (King's College London) Jeremy Bennett (Embecosm) Organizers: Sarah Mount & Laurence Tratt (King's College London)