[Python-ideas] Complicate str methods

Thu Feb 8 13:23:48 EST 2018

On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee
<leewangzhong+python at gmail.com> wrote:
> On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka at gmail.com> wrote:
> > The name of complicated str methods is regular expressions. For doing these
> > operations efficiently you need to convert arguments in special optimized
> > form. This is what re.compile() does. If make a compilation on every
> > invocation of a str method, this will add too large overhead and kill
> > performance.
> >
> > Even for simple string search a regular expression can be more efficient
> > than a str method.
> >
> > $ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
> > "spa"*100+"m"' -- 'p.search(s)'
> > 500000 loops, best of 5: 680 nsec per loop
> >
> > $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
> > 200000 loops, best of 5: 1.09 usec per loop

I ran Serhiy's tests (3.5.2) and got different results.

    # Setup:
   __builtins__.basestring = str  #hack for re2 import in py3
    import re, re2, regex

    n = 10000
    s = "spa"*n+"m"
    p = re.compile("spam")
    pgex = regex.compile("spam")
    p2 = re2.compile("spam")

    # Tests:
    %timeit s.find("spam")
    %timeit p.search(s)
    %timeit pgex.search(s)
    %timeit p2.search(s)

n = 100
350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

n = 1000
2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

n = 10000
17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Conclusions:
- `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as
re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs.
- re and regex have similar performance, probably due to a similar backend.
- re2 is slowest. I suspect it's due to the wrapper. It may be copying
the strings to a format suitable for the backend.

P.S.: I also tested `"spam" in s`, which was linearly slower than
`.find`. However, `in` is consistently faster than `.find` in my 3.6,
so the discrepancy has likely been fixed.

More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are
twice as slow as my 3.5.2 for Ubuntu For Windows, but the re
performance is similar. It's probably a compiler thing.