[Speed] Performance comparison of regular expression engines
Brett Cannon
brett at python.org
Mon Mar 7 12:19:25 EST 2016
Are you thinking about turning all of this into a benchmark for the
benchmark suite?
On Sat, 5 Mar 2016 at 11:15 Serhiy Storchaka <storchaka at gmail.com> wrote:
> I have wrote a benchmark for comparing different regular expression
> engines available in Python. It uses tests and data from [1], that were
> itself inspired by Boost's benchmark [2].
>
> Tested engines are:
>
> * re, standard regular expression module
> * regex, alternative regular expression module [3]
> * re2, Python wrapper for Google's RE2 [4]
> * pcre, Python PCRE bindings [5]
>
> Running tests for all 20MB text file takes too long time, here are results
> (time in millisecons) for 2MB chunk (6000000:8000000):
>
> re regex re2 pcre
> str.find
>
> Twain 5 2.866 2.118 12.47 3.911
> 2.72
> (?i)Twain 10 84.42 4.366 24.76 17.12
> [a-z]shing 165 125 5.466 27.78 180.6
> Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234
> \b\w+nn\b 32 239.5 427.6 23.18 251.9
> [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9
> Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5
> (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4
> .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497
> .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549
> Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8
> [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6
> \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6
> ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3
> ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6
>
> There is no absolute leader. All engines have its weak spots. For re these
> are case-insensitive search and search a pattern that starts with a set.
>
> pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000)
> results are 1-2 orders slower:
>
> re regex re2 pcre
> str.find
>
> Twain 33 2.671 2.209 16.6 413.6
> 2.75
> (?i)Twain 35 90.21 4.36 27.65 459.4
> [a-z]shing 120 112.7 2.667 30.94 1895
> Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152
> \b\w+nn\b 33 238 401.4 26.93 763.7
> [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979
> Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04
> (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04
> .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04
> .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878
> Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1
> [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05
> \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05
> ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119
> ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04
>
> [1] http://sljit.sourceforge.net/regex_perf.html
> [2]
> http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
> [3] https://pypi.python.org/pypi/regex/2016.03.02
> [4] https://pypi.python.org/pypi/re2/0.2.22
> [5] https://pypi.python.org/pypi/python-pcre/0.7
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160307/07b20d86/attachment.html>
More information about the Speed
mailing list