[Python-ideas] regex module - proper implementation of alternation?
MRAB
python at mrabarnett.plus.com
Tue Jul 16 19:39:54 CEST 2013
On 16/07/2013 17:00, Eli Bendersky wrote:
>
> On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python at mrabarnett.plus.com
> <mailto:python at mrabarnett.plus.com>> wrote:
>
> On 16/07/2013 16:41, Eli Bendersky wrote:
>
> On Tue, Jul 16, 2013 at 8:00 AM, MRAB
> <python at mrabarnett.plus.com <mailto:python at mrabarnett.plus.com>
> <mailto:python at mrabarnett.__plus.com
> <mailto:python at mrabarnett.plus.com>>> wrote:
>
> On 16/07/2013 04:10, Eli Bendersky wrote:
>
> Since the 'regex' module is a candidate for inclusion
> into the
> stdlib, I
> figured this would be a good place to ask.
>
> While discussing something related in pss
> (https://github.com/eliben/__pss__
> <https://github.com/eliben/pss__>), Ben Hoyt brought to my
>
> attention that
> the implementation of alternation (foo|bar) in Python's
> default
> regex
> module (the SRE implementation) is very inefficient.
> And indeed,
> looking
> there it seems that | gets translated to an opcode that
> simply means
> going over all the alternatives in a loop trying to
> match each.
> This is
> not how a proper regex engine should implement alternation!
>
> A common advice given to Python programmers is to
> combine their
> regexes
> into a single one with |. This makes code faster, but
> it turns
> out that
> it's far from its full potential because the
> combination doesn't
> go full
> way to the DFA as it's supposed to.
>
> A question about 'regex' - is it implemented properly
> there?
>
> There are 2 ways of implementing regex: DFA and NFA.
>
> DFA is faster, but those using NFA do so because the
> implementation
> offers additional features that make DFA tricky or
> impossible, such as
> backreferences.
>
> Of course, you could use DFA when it's possible, NFA when
> it isn't, at
> the cost of yet more code.
>
> The regex module uses NFA, just like re.
>
> If you want to improve regex, making it use DFA when
> possible, well,
> the source code is open, and your contributions are
> welcome. Good luck!
> :-)
>
>
> You can use NFA without backtracking, though, by keeping track
> of the
> set of possible states. I believe (but am not 100% sure) this is
> the way
> re2 works, for example.
>
> In the particular case of alternations, such approach is vastly
> superior
> because the "possible states" set never grows large (assuming the
> alternatives are not mostly the same). Whereas with backtracking you
> always have to iterate over all of them.
>
> That said, I think you have answered my question - regex also uses a
> backtracking implementation of NFA and iterates in case of
> alternations.
> OK :-)
>
> Have you tried timing them (re, re2, regex, and possibly others) to
> see whether
> it's a problem in practice?
>
>
> I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html -
> explains the problems with backtracking, but focuses on a different
> aspect (where backtracking leads to exponential behavior).
>
> The problem did, however, come up in practice in pss (see
> https://github.com/eliben/pss/issues/4); there was an attempt to make
> heavier use of regex alternations and the result ended up being much
> slower than one would expect. I had this experience in a different place
> as well (writing regex-based lexers).
>
> There is a performance improvement gained from moving from explicit
> Python-level looping to alternations, but it's a small gain - something
> that makes sense for just moving a loop from Python into C, but not for
> doing something actually smart with the alternation (like looking at
> each incoming character only once).
>
I'd welcome any realistic speed tests that would help me improve the
performance of regex.
More information about the Python-ideas
mailing list